How to Train an LLM with Privacy by Design
Large language models like ChatGPT took the world by storm in late 2023, starting a frenzy of AI-powered tools that have the potential to completely overhaul the digital economy in the coming years. People had never seen anything like it, heralding the technology as either the ultimate productivity tool or a straight-up replacement for jobs like copywriters and designers altogether.
ChatGPT reached 100 million users within two months of its release and was riding high in early 2023, but over the past year or so the cracks have started showing. The same goes for LLMs other Big Tech competitors have put on the market to stake their own claim to AI, such as Google’s Gemini and Meta’s Llama.
The Problem with Large Language Models
First was the realization that OpenAI and other Big Tech companies had taken data scraping to its extreme, utilizing every ounce of data they could find to train their models. This, unsurprisingly, has led to a lengthy list of lawsuits the company faces, along with even more claims that the technology is nothing but a plagiarism machine.
Despite the cautiousness a company facing this level of criticism should be exercising, OpenAI is not making its life easier, as several days ago it released its AI assistant, Sky, which sounds virtually identical to Scarlett Johansson. This comes after the company approached the actress with this offer and was rebuffed, another incident in a long line of instances where OpenAI simply did what it wanted to do.
ChatGPT, in large part, has demonstrated how an LLM should not operate, despite its impressiveness on first sight. Because LLMs carry such risk, the failure to build an ethical model, one that is reliable, responsible, and secure, can carry severe consequences.
Large language models greatly increase an organization’s attack surface, arm bad actors with more tools than ever, and necessitate in-depth and continuous training to ensure staff do not run afoul of best data privacy and security practices. Those are just part of the risk profile organizations using LLMs accept, let alone the added risks of developing or deploying these systems.
ChatGPT seems to have remained unbothered by that, given its evident failure to adhere to GDPR principles of privacy by design and privacy by default. Such is the official position of Italy’s Data Protection Authority, which deems the LLM noncompliant.
Part of the public debate is the opaqueness of how these systems work, which OpenAI representatives have not clarified, and in some cases, have obfuscated by noting even they cannot fully explain how the systems works.
For consumers wishing to exercise their data rights, that leaves them out of luck. As is, it seems virtually impossible–due to cost and sheer feasibility–to remove data a model has already scraped and been trained with, meaning GDPR-given rights cannot be fulfilled by LLMs.
While ChatGPT and other AI-powered systems are now offering methods to opt-out from the use of a user’s content to train AI, the horse is out of the barn for many of these systems that have already scraped heaps of data. The fact that these sensitive systems are not opt-in, and require a user to go looking for them to be able to opt-out are additional problems undercutting the public stance these companies are taking on AI training.
They are saying users have rights, but contradicting those statements with their actions.
An Alternative LLM Training Option
But not every LLM needs to go this route. In fact, 273 Ventures, a Chicago-based legal tech consultancy startup, released an LLM earlier this year called KL3M, proves that fact. KL3M was trained exclusively on materials that did not violate copyright law or require sensitive personal information.
The amount of data that went into the system? Roughly 350 billion units of data. The exact size of the training set used for ChatGPT is unknown, but experts have put the number at several trillion units.
This reinforces the age-old “quality vs quantity” debate, as how an LLM is trained will always come down to the data used for the model. Using too much risks degrading the overall quality of the dataset, a reality OpenAI failed to understand when developing ChatGPT.
Choosing which data to train on then depends on dividing up types of data on the internet: that protected by copyright or intellectual property law, more nebulous publicly available data, and public domain data.
Copyright, Public Domain, and Publicly Available Data
Data in the public domain is the most straightforward, since it is free and available for anyone to use. Given the massive scope of public domain (even Mickey Mouse is there now!), there should be more than enough to establish a good baseline for any generalist LLM.
Copyrighted or IP-covered materials are also straightforward, although less advantageous for those wishing to train an LLM. For this data, developers will need to license the specific data they’d like to train their system on, meaning yes, they will need to break out the checkbook. For LLMs that are looking to specialize, licensing the proper material will be key in obtaining the high-quality data needed to produce useful outputs.
A healthcare LLM will need to license medical documents and data from healthcare providers, a science LLM will need to license academic research and papers, and so on. The value in licensing data is not only ensuring you’re dealing with high quality data, but in getting that data from the source, you allow more privacy by design principles to seep into the final LLM.
If you are licensing healthcare data from a medical research firm, for example, that firm will be able to properly anonymous or pseudonymize any personal data in the dataset before handing it over for training.
If an LLM indiscriminately scraps up data–even if it is applying similar techniques–it will not be doing so with the accuracy and precision of treating the data at the source. LLMs going the extra mile and pseudonymizing sensitive data from medical records, legal documents, and public forums will likely be doing so the same way, because its most cost-effective and the system can’t distinguish between types of data, dramatically lessening the overall data security in the system.
This takes us to publicly available data, the most dangerous kind for training LLMs. Why is it dangerous? Because unlike information from New York Times articles, Getty Image collections, or Sarah Silverman’s comedy routine, the risks here are not about breaking copyright, but doing privacy harm.
Think of every internet forum you’ve ever been a part of: that is all publicly available data. It’s everything that is on the internet and available for anyone browsing to stumble across, whether a Medium article, or a video game mod hosted on Nexus Mods, or your cousin’s design portfolio. None of that is public domain because no one would have thought they needed to protect those creations … until LLMs came along.
Every word most of us have ever written online has likely been used to train the current crop of Big Tech LLMs, all done without consent or regard for privacy. That is the antithesis of ethical product development.
So how can organizations looking to train ethical LLMs in the future proceed?
Steps to Ethically Train a Large Language Model
- Choose data wisely
Any training data must suit the purpose, and the quality and relevance of the data must be the primary consideration. Put garbage in and you’ll get garbage out, having only wasted everyone’s time.
- Avoid using personal information
This is particularly important if you are scraping publicly available websites like forums where people have likely shared personal information. At the very least, there needs to be data anonymization or pseudonymization used for any sources that are not from the public domain, and thus free of privacy risk.
- Show users your sources
Transparency is key, and a considerable part of what has been lacking with ChatGPT and other notable LLMs. Make sure users know what data the system was trained on, listing sources of where the data came from and how the collective dataset is intended to influence outputs and how it historically has. Billions of data units is extensive, and you don't need to exhaustively list every one, but the bulk of where the training data came from should be clear.
LLMs are not magic machines spitting out novelties, and the more users understand that, the better they will be able to interact with one.
- Keep feedback loops open
An LLM should never stop learning, whether it's through reinforcement learning from human feedback (RLHF) or running regular audits on the data already used for training. That might sound expensive, and it is, but these systems cannot be black boxes once they go out into the public. By keeping feedback loops open, training can be adjusted incrementally over time rather than having to redo it at-scale if things go wrong.
ChatGPT of course works through RLHF, but the more data the system trains on, the less effective the feedback; this is another reason why data quality must trump data quantity.
- Flag unusual or malicious behavior
As noted before, LLMs are tremendously capable systems, and bad actors have already adopted them for use in scams, phishing schemes, and other ploys. Developers must be aware that these things happen and flag any attempts to manipulate or jailbreak models. The last thing the world needs is another Microsoft Tay controversy, where users were able to turn the chatbot into a cesspit of racism and homophobia within days.
********************************************
If LLMs are going to usher in the next wave of AI and technological innovation, they need to do so under moral guidance. Any company developing one has a duty to respect and prioritize privacy and safety, and if we collectively cannot guarantee those things, then maybe we aren’t ready for the technology.
Looking for a privacy solution for your organization's data and AI governance challenges? Try MineOS for free.