SPY Lab researchers first to ever peek into ChatGPT’s black box
In a world-first, researchers from the SPY Lab led by Professor Florian Tramèr along with collaborators have succeeded in extracting secret information on the large language model behind ChatGPT. The team responsibly disclosed the results of their “model stealing attack” to OpenAI. Following the disclosure, the company immediately implemented countermeasures to protect the model.
Researchers from the group of Professor Florian Tramèr have devised and executed an inexpensive attack on production-level large language models (LLMs) using the models' publicly available application programming interfaces (API), a tool commonly used by software developers to communicate with applications. The successful attack shows that popular chatbots like ChatGPT are susceptible to revealing secret information on the underlying models’ parameters. The work was done in collaboration with researchers at Google DeepMind, the University of Washington, UC Berkeley and McGill University.
“Our work represents the first successful attempt at learning some information about the parameters of an LLM chatbot”, Tramèr said. Although the information his team gained from the attack was limited, Tramèr points out that future attacks of this kind could be more sophisticated and therefore more dangerous.
“Our work represents the first successful attempt at learning some information about the parameters of an LLM chatbot”Professor Florian Tramèr
Companies like OpenAI, Anthropic, or Google reveal essentially nothing about the large language models they make available to the public. It was this great deal of secrecy surrounding popular online tools such as ChatGPT that motivated Tramèr and his co-workers to attempt what experts call “model stealing attacks” on them.
A new type of attack on LLMs
In the past, model stealing attacks were essentially of two types: in the first type of attack, which Tramèr and co-authors introduced eight years ago, the attacker uses the outputs of API queries to train a local proxy model that mimicks the target model's behavior. This type of attack works well but so far no one has tried it at models similar in size to ChatGPT. While easy to implement, such an attack won't reveal anything about the target model's exact parameters.
The second type of attack is much more ambitious in that it aims to recover the exact parameters of a model. These attacks are way more expensive and have only been demonstrated against very small models.
In their latest attack, Tramèr and his team aimed for a middle ground: They asked themselves whether they could recover some partial information about the model's parameters without aiming to “steal” the entire model.
Tramèr explains the details of the attack as follows: “Our attack essentially recovers the last ‘layer’ of the target model, which is the mapping that the LLM applies to its internal state to produce the next word to be predicted. This represents a very small fraction of the total number of parameters of the model as modern LLMs can have over a hundred layers. However, in a typical LLM architecture, all these layers are the same size. So, by recovering the last layer, we learn how "wide" the model is, meaning how many weights each of the models’ layers has. And in turn, this tells us something about the overall model size, because a model's width and depth typically grow proportionally.”
All the team needed was some simple linear algebra and information publicly available in OpenAI's API which they used to accelerate the attack. Overall, the attack cost amounted to just 800 US dollars in queries to ChatGPT.
The researchers disclosed their findings to OpenAI who confirmed that the extracted parameters were correct. The company went on to make changes to its API to render the attack more expensive, albeit not impossible.
The researchers will present their findings at the upcoming International Conference on Machine Learning (ICML) which will take place from July 21 through 27 in Vienna, Austria.
Reference
Nicholas Carlini, Daniel Paleka, Krishnamurthy (Dj) Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Eric Wallace, David Rolnick, Florian Tramèr: Stealing Part of a Production Language Model, International Conference on Machine Learning, Vienna, Austria, 2024. external page Blog post