+++ title = "LLM honeypot" author = ["MichaƂ Sapka"] date = 2024-06-28T14:14:00+02:00 categories = ["blog"] draft = false weight = 2002 abstract = "The only way to fight I see" +++ Big tech doesn't care about people; LLM industry actively seeks harm. We're [seeing it time after time again](https://www.theverge.com/2024/6/27/24187405/perplexity-ai-twitter-lie-plagiarism). They consider the open web to be a resource that exists only for them to harvest. But the web was designed with good intentions in mind. There is no way to actively _block_ them. Copyright? Nope, fair use. Robots.txt? Nope, some don't care - other pretend to care after the theft. Identifying them? Good luck. Not only the IPs are in _millions_, but they lie in their user-agents. Some are trying to poison the LLM by prompt injection, but this will not work in any bigger dataset. Personally, I want to at least try. Therefore, my site contains a honeypot: _open [a git repository](https://michal.sapka.me/git/mms/Library-of-knowledge) and your IP will be logged_. For now I collect them, but soon they will be blocked on my firewall for some time - a week maybe? This repo is: - disallowed by robots.txt, so no good agents would harvest it - labeled as ban hammer in the description. I'll wait for some time and publish results.