I'd like to learn from other installations, on how they manage the aggressive bots that ignore robot.txt and crawl-delay. The IP blocking is not working as the AI bots are using legits IP as proxy.
We are currently exploring captcha or silent challenge on WAF.
related news, posts, etc:
https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/
https://www.linkedin.com/pulse/bots-coming-our-metadata-rosalyn-metz-0kyve/
Thanks for also starting the thread on the mailing list about these bots.
@Oliver Bertuch I am very interested to hear what you're doing to combat botnets. I've found at least four startup companies selling bot-detection-evasion-as-a-service =(
Try https://anubis.techaro.lol or https://git.gammaspectra.live/git/go-away :smiley:
See Anubis in action at https://data.fz-juelich.de
And that's just from the last ~6 hours
yikes
These people are crazy. I'm thinking about using go-away's proxy mode to lure the failing clients into a Nephentes tar pit. So let them burn CPU on nonsense and poison their training materials
I had to turn off the Opengraph Preview Passthrough in Anubis though. They just kept on hammering on the root collection. :shrug:
Meanwhile, I'm playing with a top secret AI tool from @Slava Tykhonov and contributing to the problem. :sweat_smile:
I might let those AI bots with proper user agents back in. They seem to behave. Will need to verify their origin IPs though.
In addition, thinking about adding a cache layer to my NGINX Ingress so I can re-enable OpenGraph previews.
When you say hammering the root collection do you mean the homepage?
We don't have a static home page, we just present the root collection. So any visitor triggers a Solr search.
Solr seemed to be fine with that, but the performance of Dataverse loading the required stuff from the result just wasn't there. It couldn't keep up.
Gotcha. Hitting the homepage does call into the database as well.
@Oliver Bertuch Anubis doesn't cause any problems with callbacks?
Don Sizemore said:
Oliver Bertuch Anubis doesn't cause any problems with callbacks?
I have no idea! Didn't test that yet. How would you suggest testing it?
OAuth2 / DataCite publishing, anything that would call back.
Also, you're using Nginx and not Apache, correct?
Yes, I'm using NGINX (Ingress Controller). But Anubis and Go-Away are usable with anything, as they just are reverse proxies.
OAuth2 happens in the browser. As you solve the challenge once and then have a JWT in a cookie, that's no problemo
What kind of DataCite callbacks are you referring to? Registering a new DOI etc is just outgoing traffic, aye?
@Oliver Bertuch I'm glad to hear it's all working for you! My remaining question concerns our use of Shibboleth, and subsequent use of the AJP protocol through the Apache proxy. I'll look into this.
I'm using it like this: NGINX Ingress -> Anubis -> NGINX Router -> Dataverse Backend/Static Files/...
So you could just front your Apache with Anubis.
That obviously leaves the question of SSL termination
If you do that in Apache right now, you might need to split the setup a little
Or take a look at go-away, they have SSL stuff builtin including TLS fingerprinting
We don't have Shibboleth in the mix pending InCommon Federation publication, but Anubis seems to work well with Apache and Dataverse. Instead of specifying a DocumentRoot in the last listener, we gave it the standard ProxyPass:
<VirtualHost *:3000>
ServerAdmin <e-mail>
ServerName <fqdn>
ErrorLog /var/log/httpd/anubis_dataverse6_error.log
CustomLog /var/log/httpd/anubis_dataverse6_access.log combined
ProxyPass / ajp://localhost:8009/
ProxyPassReverse / ajp://localhost:8009/
</VirtualHost>
@Oliver Bertuch William got Anubis working with Shibboleth! Our current ssl.conf includes:
# these headers are required for Anubis
RequestHeader set "X-Real-Ip" expr=%{REMOTE_ADDR}
RequestHeader set X-Forwarded-Proto "https"
ProxyPreserveHost On
ProxyRequests Off
ProxyVia Off
<Location /shib.xhtml>
AuthType shibboleth
ShibRequestSetting requireSession 1
require valid-user
ProxyPass ajp://localhost:8009/shib.xhtml
ProxyPassReverse ajp://localhost:8009/shib.xhtml
</Location>
# port Anubis listens on
ProxyPass / http://localhost:8923/
ProxyPassReverse / http://localhost:8923/
</VirtualHost>
# actual website config
<VirtualHost *:3000>
ServerAdmin <e-mail>
ServerName <fqdn>
ErrorLog /var/log/httpd/anubis_dataverse6_error.log
CustomLog /var/log/httpd/anubis_dataverse6_access.log combined
ProxyPass / ajp://localhost:8009/
ProxyPassReverse / ajp://localhost:8009/
</VirtualHost>
Looks great! Congrats! May their connections be weighed
@Oliver Bertuch we're having trouble with Anubis' ALLOWs in bot_policies.json - are you willing to share yours?
Orly? What kind of problems? I'm using the standard config, but can also share it.
## Anubis has the ability to let you import snippets of configuration into the main
## configuration file. This allows you to break up your config into smaller parts
## that get logically assembled into one big file.
##
## Of note, a bot rule can either have inline bot configuration or import a
## bot config snippet. You cannot do both in a single bot rule.
##
## Import paths can either be prefixed with (data) to import from the common/shared
## rules in the data folder in the Anubis source tree or will point to absolute/relative
## paths in your filesystem. If you don't have access to the Anubis source tree, check
## /usr/share/docs/anubis/data or in the tarball you extracted Anubis from.
bots:
# Pathological bots to deny
- # This correlates to data/bots/ai-robots-txt.yaml in the source tree
import: (data)/bots/ai-robots-txt.yaml
- import: (data)/bots/cloudflare-workers.yaml
- import: (data)/bots/headless-browsers.yaml
- import: (data)/bots/us-ai-scraper.yaml
# Search engines to allow
- import: (data)/crawlers/googlebot.yaml
- import: (data)/crawlers/bingbot.yaml
- import: (data)/crawlers/duckduckbot.yaml
- import: (data)/crawlers/qwantbot.yaml
- import: (data)/crawlers/internet-archive.yaml
- import: (data)/crawlers/kagibot.yaml
- import: (data)/crawlers/marginalia.yaml
- import: (data)/crawlers/mojeekbot.yaml
# Allow common "keeping the internet working" routes (well-known, favicon, robots.txt)
- import: (data)/common/keep-internet-working.yaml
# # Punish any bot with "bot" in the user-agent string
# # This is known to have a high false-positive rate, use at your own risk
# - name: generic-bot-catchall
# user_agent_regex: (?i:bot|crawler)
# action: CHALLENGE
# challenge:
# difficulty: 16 # impossible
# report_as: 4 # lie to the operator
# algorithm: slow # intentionally waste CPU cycles and time
# Generic catchall rule
- name: generic-browser
user_agent_regex: >-
Mozilla|Opera
action: CHALLENGE
dnsbl: false
@Oliver Bertuch so you haven't had to specifically allow for example ^/api.* ? (for us, it was shib.xhtml, we had to tell Apache to bypass Anubis for that location. Shib wouldn't work even with shib.xhtml.* set to ALLOW
We're not using Shib, so your mileage vary.
I didn't have to do any of that, because most clients going to the API only will use not Mozilla or Opera as their user agents.
If someone uses their browser to access the API, it seems unlikely they go directly to it, but probably will go through the UI first.
Then they have the solved challenge cached and can send it along with the request
@Oliver Bertuch my grad student worked around that with:
<Location /shib.xhtml>
AuthType shibboleth
ShibRequestSetting requireSession 1
require valid-user
ProxyPass ajp://localhost:8009/shib.xhtml
ProxyPassReverse ajp://localhost:8009/shib.xhtml
</Location>
though I think I see a bit of missing configuration which should fix our excludes problem.
@Oliver Bertuch I did find that Anubis broke one other thing: dataverse-previewers. We switched course with Anubis, and told it only to challenge /dataverse/ /dataverse.xhtml and dataset.xhtml (and kept the explicit AJP bypass in the shib.xhtml location block). FWIW the test server is seeing a botnet matching Anubis' "aggressive Brazilian scrapers" criteria: https://github.com/TecharoHQ/anubis/blob/main/data/bots/aggressive-brazilian-scrapers.yaml I am very nearly ready to implement this on UNC Dataverse.
I don't have previewers yet, so I didn't notice that one! Good catch! Let the folks realm through the wheat of Duat freely. :relieved:
are you using direct upload/download? (that seems to work fine)
Nope, not yet. Will need to build an S3 proxy first.
Thanks a lot for this thread. We are facing the same problem and will try https://anubis.techaro.lol/.
If you need something brandable now, take a look at https://git.gammaspectra.live/git/go-away. Techaro wanted to make Anubis brandable but as a paid option only.
We found disabling Solr Facet works reasonably well https://guides.dataverse.org/en/latest/installation/config.html#disablesolrfacetswithoutjsession (did a manual adjustment since our version does not have this functionality). Thanks to Jim Myers for suggesting.
Related: https://library.unc.edu/news/library-it-vs-the-ai-bots/
@Bikram you're going to talk about these bots in a future community call, right? Should we start hyping it? :smile:
haha surprisingly Borealis is getting lesser Bot attention as compared to other service we host, but we are using same preventive measures for Borealis as they share the same front-end proxy.
That's good.
@Bikram and do I have it right that you want 10-15 minutes during the next call, in September?
yes, Amber suggested doing so
Great, Julian is running that call and already created an agenda for it. @Julian Gautier I'm happy to go ahead and add a bullet about bots if you like.
Thanks all. Like I wrote in the email thread I'd like confirm with Jenna first before editing the agenda
Thanks for confirming with Jenna and updating the agenda! Looks like we're on!
A critical remark on Anubis: https://lock.cmpxchg8b.com/anubis.html
We're starting the call that includes @Bikram on bots! https://docs.google.com/document/d/1daw0hHWOtd3PtSF-meM1BN6nK8JwoYhbTChUAiC-GwQ/edit?usp=sharing
Screenshot 2025-09-02 at 10.04.44โฏAM.png
@Bikram great talk and thanks for posting your slides! ![]()
thank you @Philip Durbin ๐ :)
Sure, here's the recording!
Thanks for the sharing @Bikram can pls share with us if you have tested / implemented the Cloudflare Turnstile on Borealis data download? :nerd:
Hi @Yuyun Wirawati, we have not implemented Cloudflare turnstile yet, because Borealis does not run on single URL. We have another project Odesi which is integrated with Borealis and for Odesi data Lot of schools access Borealis via Ezproxy, so the URL changes for each school. First challenge is to add all the URLs to Turnstile allowlist, second challenge is Free Cloudflare only allow 15 URLs per widget, we were in talks with them to get a 100 URL limit widget but the quotes were too high and did not work for us. We may end of moving Borealis behind Cloudflare Proxy which is included for free.
@Don Sizemore @Leo Andreev you have Anubis in place, right? Is one of you still using normal upload instead of S3 Direct Upload? I'm having trouble with Anubis not forwarding error messages from the API when a file upload fails due to size restrictions.
@Oliver Bertuch We have Anubis in place but haven't run into this particular problem (yet)
Are you using Anubis in a reverse-proxy mode in between Apache and Dataverse? Also, which version are you running?
our particular config is written up here as a sample: https://github.com/IQSS/dataverse-security/wiki/Protecting-Dataverse-with-Anubis additions and corrections always welcome.
Oliver Bertuch said:
Are you using Anubis in a reverse-proxy mode in between Apache and Dataverse? Also, which version are you running?
In the case of Shibboleth, we provided an explicit Anubis bypass. I think Leonid did this for the API as well?
ProxyPass ajp://localhost:8009/shib.xhtml
ProxyPassReverse ajp://localhost:8009/shib.xhtml
Don Sizemore said:
our particular config is written up here as a sample: https://github.com/IQSS/dataverse-security/wiki/Protecting-Dataverse-with-Anubis additions and corrections always welcome.
I'm getting a 404 on that url @Don Sizemore
@Michael Madsen correct, that's a private repo. you may request access via security@dataverse.org
I'm new to bot mitigation and find this thread useful!
Based on the conversations here I see that most are opting for a 3rd party solution - which we will be investigating. Meanwhile I've tried simple rate limiting as provided by setting :RateLimitingDefaultCapacityTiers. We've found that during intense bot activity we will get to the point where the 429 response kicks in.
I was expecting that just the IPs of the bots exceeding set values would be blocked. Instead it appears all non signed in users were getting blocked once the count exceeded settings - including myself via a browser.
Is this expected behavior? Is :RateLimitingDefaultCapacityTiers (and :RateLimitingCapacityByTierAndAction) not intended to deter bots? Are there additional settings other than "-X PUT -d '[x],[y]'" that need to be set? Thanks
@Frank Smutniak yes, I owe Leonid a documentation pull request here. IIRC not-logged-in users are tier 0 and subject to the rate limit, while logged-in users are tier 1 and not subject to any limit. The fix is to manually browse to /loginpage.xhtml and sign in. This was the "nuclear" option in bot protection.
@Steven Winship is the expert on rate limiting. He might have some more insight for us.
Tier0 is the rate limit for all Guest users. Logged in users can be added to a tier but there is no way to limit by IP address.
Thanks all. That corresponds with what we saw and we will investigate other options mentioned in this topic.
Last updated: Nov 01 2025 at 14:11 UTC