Protect public LLMs from having access to PII and PHI
Many folks are adamant and don’t trust the 3rd party LLM providers to not misuse their data. One solution is to use an on-premises language model, but that has it’s own risks and headaches.
One possible solution is to change the data before presenting to an LLM for analysis and then change the data back to the original text.
This is my proposed solution: PIIGPT
PIIGPT creates a PIIScrubber that analyzes the text for PII. The scrubber consists of 3 components:
- Analyzer — Tool for detecting PII and PHI
- Anonymizer — Tool for replacing PII and PHI with random characters
- Cache — A temporary storage to link the anonymous text with the original text
Analyzer:
Currently it uses the AzureAI Text Analytics tool by default. But it is extensible so any analyzer could be plugged in. I have plans in the future to add additional analyzers, including ones that don’t require an api. However, Azure’s Text Analytics tool does a phenomenal job.
Anonymizer:
The anonymizer is configurable through a config.toml file. By specifying the regex pattern for a particular entity, the anonymizer creates an anonymous entry matching that pattern. If no pattern is specified for that entity in the config file, a random sequence is provided either way.
The anonymizer can also be replaced by a completely different anonymizing tool.
CacheProvider:
The default CacheProvider just uses a dictionary with a key/value pair with the key being the anonymous text and the value being the replaced text. A thread is started to scan for the time-to-live(TTL) and destroy the key/value pair to prevent the data from remaining in memory for longer than the specified TTL.
from Analyzers.AnalyzerType import AnalyzerType
from PIIScrubber import PIIScrubber
def main():
from dotenv import load_dotenv
load_dotenv("sample.env")
pii = PIIScrubber(AnalyzerType.AZURE)
text = "My phone number is 555-555-5555"
print(pii.scrub([text]))
print(pii.get_entities([text]))
print(pii.anonymize(pii.get_entities([text]), text))
print(pii.deanonymize(pii.anonymize(pii.get_entities([text]), text)))
if __name__ == "__main__":
main()
['My phone number is ************']
[Text: 555-555-5555, Category: PhoneNumber, Subcategory: None, Offset: 19, Length: 12]
My phone number is :yudDNDuGJG:
My phone number is 555-555-5555
When using chatGPT, anonymize the text before sending it to the LLM. When receiving the text back, call denonymize and the the text is restore.
I hope you find this program useful. I welcome you to contribute additional analyzers, anonymizers, cache providers, or any other functionality to make it useful for you.