Dealing with foundation models, data protection, and copyright matters in the EU AI Act

Published in

PrivacyCloud

16 min readMar 4, 2024

The following is a transcript of our recent interview with MEP Dragos Tudorache on Masters of Privacy. The original recording can be found on the podcast’s website, as well as on your favorite podcasting feed.

Sergio Maldonado: Dragos, Thanks for joining me.

Dragos Tudorache: Thank you for inviting me.

S.M. I know you had -I believe it was- a 22-hour negotiation on December 6th, December 7th. I know foundation models were controversial. How did you get to a compromise?

D.T. Well, it’s true that going into those negotiations, which we knew had to be the last stand in a way, and simply because we all, all the three institutional players at the table, Commission, Council and Parliament, we knew we wanted to wrap up the negotiations by the end of the year. So somehow we entered, all three of us, we entered those negotiations on the 6th with the idea that we finished. But as we were going in, we knew that there would be three big, as you call them, points of controversy, let’s say.

One was the issue of biometrics, and more broadly, the issue of access by law enforcement to certain parts of the AI technology that they use. One was related to governance, and somehow linked to that, the issue, indeed, the foundational models. And we knew it would be difficult, first of all, for a formal, let’s say institutional reason, which was the fact that we, as an institution as Parliament, we were the only ones out of the three that actually had put forward something in the text on foundation models, a regime crafted for that, whereas the Council and the Commission didn’t.

So that already in a negotiation is very difficult, because you start from three different corners of the room. But beyond the institutional aspects, also in terms of substance, we knew that it would be difficult, first of all, because the issue was quite fluid for many dealing with foundational models, putting the finger on actually what they were, what kind of obligations you need to put in place, knowing that it’s a very fluid, quickly advancing technology. And last but not least, particularly important, that the stakes were enormous.

The interests around the issue of foundational models and how to deal with them were enormous. It was felt through the position of some of the member states that had been expressed already prior to the negotiation. It’s now known that there were voices, particularly in France and Germany, that were advocating very strongly, some of them even for no obligations whatsoever, for no regime whatsoever, the foundational models trying to somewhat shift the responsibility for down the value chain to users or to developers of systems on top of the models, and certainly resisting any idea of accountability for the developers of models.

And also, we knew ourselves as Parliament, even if we had that regime in our mandate, that we also had to keep up with the evolutions of the technology from the moment when we had our own text drafted in the spring of 2023, because in the meantime, we had seen quite a rapid development. We saw that indeed there are many nuances in between different models and how they function, and that we had to find some solid objective criteria to differentiate, to make sure that we allocate responsibilities where they were needed. We also, in Parliament, did not want to unnecessarily cast too broad of a net and caught in this net of obligations, developers that did not deserve to be there.

So again, we started off by knowing there’s going to be a tough discussion, and it proved to be indeed a difficult one. I think we spent about half of those 22 hours of negotiations on this issue alone. But of course, I’m subjective when I say that, but I believe that we managed to find that element of balance, that equilibrium between the different interests at play, and we provided a system that makes sense.

S.M. Very good, thank you. So, the way it stands today, the law seems very technical to me. It seems like it’s going so specific when it comes to differentiating what’s high risk first, then within foundation models, the fact that you can build a system on top, and the fact that the models themselves are going to be regulated by the commission, this AI body, what’s the name of that?

D.T. The AI office.

S.M. Thank you, the AI office, and there’s going to be The Board across the member states for the systems. So That, I gather, but there’s so many intricacies that now people are wondering, what’s the difference? This is a very specific question: Between building an AI system on top of a sort of so-called closed AI model, like OpenAI’s, or what’s been called so far an open source model, such as Mistral, is there a major difference if it’s a European startup building a system on top of either one of them in terms of paperwork and what’s perceived as this regulatory burden? Do you think there’s going to be an advantage for those that build on top of open source?

D.T. Well, I think it’s first and foremost important to make a very clear distinction between models and systems. I know that it is indeed technical, and it’s very difficult for some to feel the difference, but there is a difference. What we are regulating in the special regime that we’ve elaborated, and which was part of this very difficult negotiation, is the responsibility for the models. Why is that? And what do we mean by models? We mean those, that’s why they’re called foundational, but it’s basically those AI models that are the point of origin, let’s say, which you can then afterwards build on top of or reorient towards specific uses, and that is a system.

In other words, you take a model, and then out of that model, you develop a system that can be allocated to a specific task to do a certain thing. The model is broad, is capable of almost anything, and we know that, and it’s one of the characteristics of these foundation models, their versatility in terms of use. And it is then for other providers that want to develop systems on the base of those models to reshape them, to retrain them, to reorient them towards specific purposes.

And then, once you are one of those providers of a system, then the rules in the act apply to you depending on the purpose that you allocate to that particular system. So that’s an important difference to make. Now, coming back to models.

The models, again, as I said earlier, we’ve realized that we need to also differentiate within this family of models. Why? Because even though they all have this characteristic of trading with large amounts of data, having this versatility, but some are more prone to systemic impact and systemic risk than others. And therefore, we thought that it made sense to allocate responsibilities depending on the likelihood of risks and harms arising out of these models.

The question very quickly afterwards, once we said we want to differentiate, was what criteria do you use to make the differentiation? What is actually making certain models more potent than others? And then we played with different options. And in fact, for quite a while, because people think that everything happened during that night. No, in fact, for three months prior to that, already as of August, September, we were each trying to find what are the best ways to provide that differentiation between the two tiers.

And ultimately, out of that analysis, based on evidence that we had gathered from all the scientific communities that were number crunching on this, the one criteria that came out as the sole objective one, that for now, gives us an element of separation between models is the compute power. Realizing that this cannot be final, because in fact, the tendency of technology will actually be to reduce in time the compute needed to train certain algorithms. And also that there might be other elements that in the future might prove to be characteristic for what makes a model more prone to risk or to harm than others.

So we did two things. First, we said that for now, we are going to use this power of compute as the differentiator between the top tier models and the lower tier models. But we also introduced the element of flexibility and said that the EU office not only can but should in the future, continuously assess and reassess and modify these technical criteria depending on the evolution of technology and will do so with assistance from the scientific panel that will be attached to the office that will play quite an important role in making sure that the office remains constantly in touch with the realities of how this technology continues to evolve.

So this is how we’ve provided the differentiation, we’ve left quite a lot of flexibility to the future regulators, the office who has the sole competence to deal with the implementation and the enforcement of the rules related to foundation models. So the one actor, the one regulator for the providers of foundation models will be the office, not the national authorities. Understanding the very high impact of these models, also the power of the companies sitting behind these models and therefore the need to have one sole European level actor interacting with these providers.

So we have this exclusive competence for the office, we have this two tier and we have a different set of requirements for the lower tier and the upper tier. The question on open source which we’ve also debated quite a lot, was how do we deal with open source? First of all, to understand what indeed is open source and we’ve provided a definition. I’ve seen the definition in the law.

That would, in the scope of the act, would be considered to be an open source because you are rightly so. There are many out there who advertise themselves as open source, but in fact they’re not, for example, releasing the model weights or they’re not releasing other elements of it which in fact would make it open source, but in fact it is not. Even if it doesn’t work with APIs like the closed systems with access points, but still by not releasing all of the data, they do not qualify as an open source.

But if you do qualify as an open source, so if you do respect the criteria in the text and you are an open source, then you are exempt from some of the obligations, particularly related to documentation, that the proprietors of proprietary closed models would otherwise have. There is one set of obligations that would apply still even to the open source, which are the ones related to copyright, because there we thought that no matter whether you are open or closed, you still have an obligation of transparency for the copyrighted material that you use in training. So there is the one obligation that is common to open source and closed, but otherwise if you are open source and you are in the lower tier of foundation models, then you do not have to abide by the rest of the obligations that the closed models have.

And if you are an open source that is in the first tier, so one of the big systemic impact models, it doesn’t matter whether you are open source or not, you would still have to respect the specific obligations for the top tier, which is risk management, incident reporting, red teaming, so all of the obligations that are basically linked to the higher level of responsibility that you should have as a provider of such a model.

S.M. Very good. We were all wondering how are they going to manage to make sure that the American, sometimes there’s this you know skepticism of EU law, and you know from the US perspective, there will always be, there’s always going to be this criticism that the EU regulates in a way that protects the local industry, and since larger players are in the US, people are wondering how they’re going to make sure they draw the line in a way that it affects the Anthropics and OpenAIs and so on, it doesn’t affect the Europeans, but because there was the open source element, there was another element that doesn’t affect computing power.

So I’m also wondering, one key element of what we call open source so far, even though I do not think it will fall within the definition, is that we can deploy locally, and I think that’s a huge advantage, that you can take one of these models and bring it to the edge, or bring it to your own cloud, which is something you cannot do with an OpenAI. So I know many startups are looking at this, deploying other clouds in Europe. What’s the advantage for the AI system in terms of paperwork, in terms of compliance, is there an advantage versus deploying a GPT, one of these tiny GPTs on top of the OpenAI store?

D.T. Well, if I understood the question correctly, so if you want to, first of all, if you want to develop a system, on top of a model, it doesn’t matter whether the model comes from the US or from Europe, or whether it’s open source or not. You develop a system, again, you have, you are out of the scope, all the chapter that refers to models does not apply to you, you don’t have to worry about that.

What you have to worry is, if you are developing a general purpose system, then the obligations that you have, and there again, it doesn’t matter whether you are proprietary or not, you will have obligations that are strictly linked to the value chain. In other words, if you’re a general purpose system, you put it on the market, you have obligations only insofar as others that would be taking your system and turning into something that might become high risk, in order for them to comply with the high risk obligations, they will need things from you, which you will need to pass to them. So that’s your obligation of keeping the kind of documentation and making it available, of course, based on agreements based on contracts that have to be fair, but basically, you will just have to pass on the data that others using your system and turning it into something that as a scope is coming under the high risk, and it’s three contexts that are listed as high risk, then that’s the only obligation you have.

So I think that’s the only thing that developers of general purpose systems have to worry about. If you develop on top of a model, you develop a system that is already directed towards a particular purpose, you’re developing a recruitment tool, you’re developing an education tool, you’re developing a banking or insurance profiling tool. So anything that comes directly under the high risk, then you would have those obligations.

And again, whether you are using a model that you have installed in the cloud, or you’ve taken it on the edge, or you’re keeping it in your garage, it won’t matter, it doesn’t make a difference. In fact, it is only what should be your question as a provider, as a developer, as a system, is whether you’re doing it in the context that is high risk. If not, then you’re not bothered by this regulation at all.

You can do whatever you want to do, the act is indifferent to you.

S.M. To rephrase that, and to make sure I can simplify it, since these days, most people are developing on top of generative AI, or foundation models, then you will have at the very core, the foundation model, which is general purpose modeling. And that could be of high impact or not.

So they may be using lots of computing power and have, again, potentially… And it’s the providers that have obligations, not you taking the model. Very good. And then they need to give you their paperwork, so you can pass it on, if you are developing a general purpose system on top of such general purpose model. Then once you build the general purpose system, someone else is going to be doing something specific. For example, a chatbot could be used for so many things. If it’s used for human resources, then it may be high risk.

Therefore, then they have their own obligations. But you do your part by simply following the chain of custody of all of that accountability. Okay. And then if you’re only directly sitting on top of a model, regardless of whether it’s open or closed, and you’re doing something very specific that is not high risk, because it’s not in that list, then you’re okay. You simply need to make sure that you look at the documentation that is provided to you by whoever trained the model, whoever developed the model. Yeah. And at the same time, if you’re adding content and retraining or sort of adding different weights, of course, you need to make sure that you respect copyright.

So something else I was wondering is, why do we have so many redundancies in the law? I understand that we need to be very careful not to step on the GDPR. I understand we’re going to be careful not to step on copyright law. Since copyright law already contained the exception for text mining, and already contained an opt out, then why did we need to make sure that people comply with copyright if we have a copyright law that should already right cover this?

And second, or even second of three, in the context of data protection, it’s made clear by the law that it will not affect automated decision making, article 22 of GDPR, and it’s clear that all of the GDPR data subject rights prevail. The only space where I can feel would be an interference with GDPR would be in the opposite direction. Since we need training data, quality data, to make AI better, I guess expanding the exceptions for scientific research and so on would be a way to, you know, sort of have an impact on the GDPR, but it goes in the opposite direction, to make it more flexible so that we can use more data. And the third one would be, the third scenario is product liability. We already had product liability directives, so why did we have so many redundancies if it was not to change those laws?

D.T. So let’s start with copyright. I don’t think that the way we have drafted what we have in the text is a repetition or a redundancy with the copyright law. What we wanted to make sure is in the context of AI, which was the context that did not exist as such when the copyright legislation was drafted and also when the data mining provision was included in there, we wanted to make sure that we make those opt-outs, the opt-out rights, as well as the rights of copyright holders in general, we make them effective. How do you make effective if you don’t know whether your, how even the opt- out is effective if you have no idea whether your work has been used in the training of an algorithm?

That was essentially the question that we asked ourselves, trying to square this equation, this dilemma between do we want to, because there were tendencies inside parliament, even at the time when we were crafting our mandate, to actually write even more in this text of the AI Act in terms of copyright, recognizing that we are now in a different world with the AI and that requires some more specificity. I’ve resisted that challenge because, again, exactly because I did not want to create redundancies and I did not want to reinvent copyright with the AI. But I did want, and I thought that was the minimum duty that we had with this text, which was again to make sure that those rights under the copyright legislation are effective and can be effectively exercised by the holders of those rights. Hence the provision on, so we’ve done what? We’ve introduced this sort of reminder, this redundancy if you want, to the whole opt-out principle. And then we also specified the obligation of transparency for the models so that, again, if I’m a holder of a right, I know if my work was used in training of an algorithm, then if I feel that somehow that touches upon my rights, I can then use the copyright legislation to then valorize my rights if I want to.

So, again, this is how we have approached copyright. In terms of GDPR, with GDPR we had a bit of a similar story because, again, there were tendencies inside the negotiating team to maybe tweak GDPR here and there with the AI Act. But, again, we did not want to because, listen, again, we are not reinventing GDPR with the AI Act.

If someone wants to reinvent GDPR, fine, but it should be done standalone as a revision of the GDPR, not through, not via the AI Act. But it’s true, as you rightly said, that there were some points where we thought that we needed to give some extra flexibility without going outside of the GDPR. GDPR still remains the golden standard, but we thought that there are some points where we needed to make sure that we are, by pushing a little bit further, again, not bending the rules, but basically pushing them a bit to make sure that we have that flexibility to allow for testing, to allow for research, including on issues that are very important, for example, bias.

We’ve had a lot of discussion on bias because we said, listen, if you want, if you put an obligation for providers of high-risk models to be monitoring actively for bias, which is a very important thing when you deal with algorithms and with their, unfortunately, capability to actually accelerate certain processes. Then you need to be able to have some use of personal data. And that’s why, again, there we’ve been quite specific on exactly how you do that.

So for the purpose of monitoring bias, you can play with personal data, of course, with certain guarantees that we’ve introduced in the text. So you can see it as a redundancy, but it’s there to do exactly that element of flexibility that I mentioned earlier. Same thing, for example, in the sandboxes.

Recognizing that in the sandbox, again, when you are in that sort of research mode, when you are in a trial and error mode, where you are testing, you maybe make mistakes, you are correcting course, you are with a regulator next to you trying to test all the assumptions in your model, in the system. Again, you need some flexibility in terms of the GDPR regime to make sure that, again, you use the datasets as real as possible, as effectively as possible, but without, again, breaking them. So this is how we’ve played with it. And I think there was a third element related to safety.

S.M. Product liability.

D.T. Sorry, product liability. Well, I think there was also this talk about the directive coming on top of it, which would deal with liability. Eventually, it was considered by the commission that it’s no longer necessary, that proposal was withdrawn, and I think it’s good that they withdrew it, because I think it’s good if we only act as the basis to deal with it. And I think that what we have right now in the text is enough in that respect. I wouldn’t be adding right now anything on top of it.

S.M. Dragos, Many thanks.

Dealing with foundation models, data protection, and copyright matters in the EU AI Act

Written by Sergio Maldonado