Posted on

LatticeFlow’s LLM framework makes a first attempt to compare Big AI’s compliance with EU AI law

LatticeFlow’s LLM framework makes a first attempt to compare Big AI’s compliance with EU AI law

While lawmakers in most countries are still debating how to build protections for artificial intelligence, the European Union is ahead of the game after adopting a risk-based framework to regulate AI apps earlier this year.

The law came into force in August, although the full details of the EU-wide AI governance regime are still being worked out – for example, codes of conduct are currently being drawn up – but over the coming months and years the graduated provisions of the law will expand with application Start with AI app and model manufacturers so that the compliance countdown is already underway.

The next challenge is to assess whether and how AI models meet their legal obligations. Large language models (LLM) and other so-called basic or general-purpose AIs will underlie most AI apps, so it seems important to focus evaluation efforts on this level of the AI ​​stack.

Take a step forward: LatticeFlow AI, a spin-out from ETH Zurich focused on AI risk management and compliance.

On Wednesday, the company published what it said was the first technical interpretation of the EU AI law, i.e. ‘compl-ai’…see what they did there!).

The AI ​​model assessment initiative – which they also call “the first regulation-oriented LLM benchmarking suite” – is the result of a long-term collaboration between the Swiss Federal Institute of Technology and the Bulgarian Institute of Informatics, Artificial Intelligence and Technology (INSAIT). , per LatticeFlow.

AI model manufacturers can request an assessment of their technology’s compliance with the requirements of the EU AI law via the Compl-AI website.

LatticeFlow has also published model evaluations of several mainstream LLMs, such as various versions/sizes of Meta’s Llama models and OpenAI’s GPT, along with a EU AI Act compliance leaderboard for Big AI.

The latter evaluates the performance of models from companies such as Anthropic, Google, OpenAI, Meta and Mistral against legal requirements – on a scale from 0 (i.e. no compliance) to 1 (full compliance).

Other evaluations are marked as N/A (ie not available if data is missing or not applicable if the model manufacturer does not provide the function). (Note: At the time of writing, there were also some negatives, but we were told this was due to a bug in the Hugging Face UI.)

LatticeFlow’s framework evaluates LLM responses against 27 benchmarks such as “toxic completions of harmless text”, “biased responses”, “following harmful instructions”, “truthfulness” and “logical reasoning”, to name just a few of the benchmarking categories, which it uses for evaluations. Therefore, each model receives a range of ratings (or N/A) in each column.

AI compliance is mixed

How did the big LLMs fare? There is no overall rating of the model. So performance varies depending on what exactly is being evaluated – but there are some notable ups and downs across different benchmarks.

For example, all models show strong performance when it comes to not following malicious instructions. and consistently performed relatively well when it came to not giving biased answers – whereas results were much more mixed in the areas of reasoning and general knowledge.

Elsewhere, recommendation consistency, which the framework uses as a measure of fairness, was particularly poor across all models – with none performing above the halfway mark (and most well below).

Other areas – such as the suitability of training data and the reliability and robustness of watermarks – appear to be essentially unevaluated, with many results marked “N/A”.

However, LatticeFlow points out that there are certain areas where it is more difficult to assess model compliance, such as hot-button issues such as copyright and privacy. So it doesn’t pretend to have all the answers.

In a paper detailing the work on the framework, the scientists involved in the project highlight that most of the smaller models they evaluated (≤ 13B parameters) “perform poorly in terms of technical robustness and security have cut off”.

They also found that “almost all models examined struggle to achieve high levels of diversity, non-discrimination and fairness.”

“We believe that these shortcomings are primarily due to model providers focusing disproportionately on improving model capabilities, at the expense of other important aspects highlighted in the regulatory requirements of the EU AI law,” they add and suggest that as compliance deadlines begin to bite, LLMs will be forced to shift their focus to problem areas – “leading to a more balanced development of LLMs”.

Since no one yet knows exactly what will be required to comply with the EU AI law, LatticeFlow’s framework is inevitably still a work in progress. It is also just an interpretation of how the requirements of the law could be translated into technical results that can be evaluated and compared. But it’s an interesting start to an ongoing effort to explore powerful automation technologies and try to guide their developers toward safer applications.

“The framework is a first step towards a full, compliance-focused assessment of the EU AI law – but is designed to be easily updated to be in step with the updating of the law and the progress of the various working groups “To keep up,” Petar Tsankov, CEO of LatticeFlow, told TechCrunch. “The EU Commission supports this. We expect the community and industry to further develop the framework towards a complete and comprehensive AI Act assessment platform.”

Summarizing the key findings so far, Tsankov said it was clear that AI models were “predominantly optimized for capabilities rather than compliance.” He also highlighted “significant performance gaps” and noted that some high-performing models can keep up with weaker models in terms of compliance.

Of particular concern, according to Tsankov, are cyber resilience (at the model level) and fairness, with many models scoring less than 50% in the former area.

“While Anthropic and OpenAI have successfully designed their (closed) models to counter jailbreaks and prompt injections, open source providers like Mistral place less emphasis on this,” he said.

And since “most models” perform equally poorly on fairness benchmarks, he suggested that this should be a priority for future work.

On the challenges of benchmarking LLM performance in areas such as copyright and data protection, Tsankov explained: “For copyright, the challenge is that current benchmarks only look for copyrighted books. This approach has two key limitations: (i) it does not take into account potential copyright infringements affecting materials other than these specific books, and (ii) it relies on quantifying model retention, which is notoriously difficult.

“For data protection, the challenge is similar: the benchmark simply tries to determine whether the model has stored certain personal information.”

LatticeFlow is interested in seeing the free and open source framework adopted and improved by the broader AI research community.

“We invite AI researchers, developers and regulators to join us in advancing this evolving project,” Professor Martin Vechev of ETH Zurich and founder and scientific director of INSAIT, who is also involved in the work, said in a statement. “We encourage other research groups and practitioners to contribute by refining the AI ​​Act’s mapping, adding new benchmarks, and expanding this open source framework.

“The methodology can also be extended to assess AI models against future regulatory acts beyond the EU AI Law, making it a valuable tool for organizations working in different jurisdictions.”