Aggressive crawlers / bots · community

I'd like to learn from other installations, on how they manage the aggressive bots that ignore robot.txt and crawl-delay. The IP blocking is not working as the AI bots are using legits IP as proxy.
We are currently exploring captcha or silent challenge on WAF.

Philip Durbin 🚀 (Apr 30 2025 at 11:00):

Don Sizemore (Apr 30 2025 at 13:40):

@Oliver Bertuch I am very interested to hear what you're doing to combat botnets. I've found at least four startup companies selling bot-detection-evasion-as-a-service =(

Oliver Bertuch (Apr 30 2025 at 13:47):

Oliver Bertuch (Apr 30 2025 at 13:48):

Oliver Bertuch (Apr 30 2025 at 13:50):

Philip Durbin 🚀 (Apr 30 2025 at 13:50):

Oliver Bertuch (Apr 30 2025 at 13:51):

These people are crazy. I'm thinking about using go-away's proxy mode to lure the failing clients into a Nephentes tar pit. So let them burn CPU on nonsense and poison their training materials

Oliver Bertuch (Apr 30 2025 at 13:53):

I had to turn off the Opengraph Preview Passthrough in Anubis though. They just kept on hammering on the root collection. :shrug:

Philip Durbin 🚀 (Apr 30 2025 at 13:53):

Meanwhile, I'm playing with a top secret AI tool from @Slava Tykhonov and contributing to the problem. :sweat_smile:

Oliver Bertuch (Apr 30 2025 at 13:55):

I might let those AI bots with proper user agents back in. They seem to behave. Will need to verify their origin IPs though.

Oliver Bertuch (Apr 30 2025 at 13:56):

In addition, thinking about adding a cache layer to my NGINX Ingress so I can re-enable OpenGraph previews.

Philip Durbin 🚀 (Apr 30 2025 at 13:57):

Oliver Bertuch (Apr 30 2025 at 13:58):

We don't have a static home page, we just present the root collection. So any visitor triggers a Solr search.

Oliver Bertuch (Apr 30 2025 at 13:58):

Solr seemed to be fine with that, but the performance of Dataverse loading the required stuff from the result just wasn't there. It couldn't keep up.

Philip Durbin 🚀 (Apr 30 2025 at 14:03):

Don Sizemore (Apr 30 2025 at 14:13):

Oliver Bertuch (Apr 30 2025 at 14:56):

Don Sizemore (Apr 30 2025 at 15:30):

OAuth2 / DataCite publishing, anything that would call back.
Also, you're using Nginx and not Apache, correct?

Oliver Bertuch (Apr 30 2025 at 15:34):

Yes, I'm using NGINX (Ingress Controller). But Anubis and Go-Away are usable with anything, as they just are reverse proxies.

Oliver Bertuch (Apr 30 2025 at 15:35):

OAuth2 happens in the browser. As you solve the challenge once and then have a JWT in a cookie, that's no problemo

Oliver Bertuch (Apr 30 2025 at 15:35):

What kind of DataCite callbacks are you referring to? Registering a new DOI etc is just outgoing traffic, aye?

Don Sizemore (Apr 30 2025 at 17:10):

@Oliver Bertuch I'm glad to hear it's all working for you! My remaining question concerns our use of Shibboleth, and subsequent use of the AJP protocol through the Apache proxy. I'll look into this.

Oliver Bertuch (Apr 30 2025 at 17:14):

I'm using it like this: NGINX Ingress -> Anubis -> NGINX Router -> Dataverse Backend/Static Files/...

Oliver Bertuch (Apr 30 2025 at 17:14):

Oliver Bertuch (Apr 30 2025 at 17:15):

Oliver Bertuch (Apr 30 2025 at 17:16):

Or take a look at go-away, they have SSL stuff builtin including TLS fingerprinting

Don Sizemore (May 08 2025 at 14:25):

We don't have Shibboleth in the mix pending InCommon Federation publication, but Anubis seems to work well with Apache and Dataverse. Instead of specifying a DocumentRoot in the last listener, we gave it the standard ProxyPass:

<VirtualHost *:3000>
  ServerAdmin <e-mail>
  ServerName <fqdn>
  ErrorLog /var/log/httpd/anubis_dataverse6_error.log
  CustomLog /var/log/httpd/anubis_dataverse6_access.log combined

  ProxyPass / ajp://localhost:8009/
  ProxyPassReverse / ajp://localhost:8009/
</VirtualHost>

Don Sizemore (May 13 2025 at 15:42):

@Oliver Bertuch William got Anubis working with Shibboleth! Our current ssl.conf includes:

  # these headers are required for Anubis
  RequestHeader set "X-Real-Ip" expr=%{REMOTE_ADDR}
  RequestHeader set X-Forwarded-Proto "https"
  ProxyPreserveHost On
  ProxyRequests Off
  ProxyVia Off

  <Location /shib.xhtml>
    AuthType shibboleth
    ShibRequestSetting requireSession 1
    require valid-user

    ProxyPass ajp://localhost:8009/shib.xhtml
    ProxyPassReverse ajp://localhost:8009/shib.xhtml
  </Location>

  # port Anubis listens on
  ProxyPass / http://localhost:8923/
  ProxyPassReverse / http://localhost:8923/
</VirtualHost>

# actual website config
<VirtualHost *:3000>
  ServerAdmin <e-mail>
  ServerName <fqdn>
  ErrorLog /var/log/httpd/anubis_dataverse6_error.log
  CustomLog /var/log/httpd/anubis_dataverse6_access.log combined

  ProxyPass / ajp://localhost:8009/
  ProxyPassReverse / ajp://localhost:8009/
</VirtualHost>

Oliver Bertuch (May 13 2025 at 16:07):

Don Sizemore (May 17 2025 at 11:15):

@Oliver Bertuch we're having trouble with Anubis' ALLOWs in bot_policies.json - are you willing to share yours?

Oliver Bertuch (May 17 2025 at 11:16):

Orly? What kind of problems? I'm using the standard config, but can also share it.

Oliver Bertuch (May 17 2025 at 11:19):

## Anubis has the ability to let you import snippets of configuration into the main
## configuration file. This allows you to break up your config into smaller parts
## that get logically assembled into one big file.
##
## Of note, a bot rule can either have inline bot configuration or import a
## bot config snippet. You cannot do both in a single bot rule.
##
## Import paths can either be prefixed with (data) to import from the common/shared
## rules in the data folder in the Anubis source tree or will point to absolute/relative
## paths in your filesystem. If you don't have access to the Anubis source tree, check
## /usr/share/docs/anubis/data or in the tarball you extracted Anubis from.

bots:
  # Pathological bots to deny
  - # This correlates to data/bots/ai-robots-txt.yaml in the source tree
    import: (data)/bots/ai-robots-txt.yaml
  - import: (data)/bots/cloudflare-workers.yaml
  - import: (data)/bots/headless-browsers.yaml
  - import: (data)/bots/us-ai-scraper.yaml

  # Search engines to allow
  - import: (data)/crawlers/googlebot.yaml
  - import: (data)/crawlers/bingbot.yaml
  - import: (data)/crawlers/duckduckbot.yaml
  - import: (data)/crawlers/qwantbot.yaml
  - import: (data)/crawlers/internet-archive.yaml
  - import: (data)/crawlers/kagibot.yaml
  - import: (data)/crawlers/marginalia.yaml
  - import: (data)/crawlers/mojeekbot.yaml

  # Allow common "keeping the internet working" routes (well-known, favicon, robots.txt)
  - import: (data)/common/keep-internet-working.yaml

  # # Punish any bot with "bot" in the user-agent string
  # # This is known to have a high false-positive rate, use at your own risk
  # - name: generic-bot-catchall
  #   user_agent_regex: (?i:bot|crawler)
  #   action: CHALLENGE
  #   challenge:
  #     difficulty: 16  # impossible
  #     report_as: 4    # lie to the operator
  #     algorithm: slow # intentionally waste CPU cycles and time

  # Generic catchall rule
  - name: generic-browser
    user_agent_regex: >-
      Mozilla|Opera
    action: CHALLENGE

dnsbl: false

Don Sizemore (May 17 2025 at 11:38):

@Oliver Bertuch so you haven't had to specifically allow for example ^/api.* ? (for us, it was shib.xhtml, we had to tell Apache to bypass Anubis for that location. Shib wouldn't work even with shib.xhtml.* set to ALLOW

Oliver Bertuch (May 17 2025 at 12:00):

Oliver Bertuch (May 17 2025 at 12:01):

I didn't have to do any of that, because most clients going to the API only will use not Mozilla or Opera as their user agents.

Oliver Bertuch (May 17 2025 at 12:01):

If someone uses their browser to access the API, it seems unlikely they go directly to it, but probably will go through the UI first.

Oliver Bertuch (May 17 2025 at 12:02):

Then they have the solved challenge cached and can send it along with the request

Don Sizemore (May 19 2025 at 12:31):

  <Location /shib.xhtml>
    AuthType shibboleth
    ShibRequestSetting requireSession 1
    require valid-user

    ProxyPass ajp://localhost:8009/shib.xhtml
    ProxyPassReverse ajp://localhost:8009/shib.xhtml
  </Location>

Don Sizemore (May 19 2025 at 13:07):

though I think I see a bit of missing configuration which should fix our excludes problem.

Don Sizemore (May 19 2025 at 19:17):

@Oliver Bertuch I did find that Anubis broke one other thing: dataverse-previewers. We switched course with Anubis, and told it only to challenge /dataverse/ /dataverse.xhtml and dataset.xhtml (and kept the explicit AJP bypass in the shib.xhtml location block). FWIW the test server is seeing a botnet matching Anubis' "aggressive Brazilian scrapers" criteria: https://github.com/TecharoHQ/anubis/blob/main/data/bots/aggressive-brazilian-scrapers.yaml I am very nearly ready to implement this on UNC Dataverse.

Oliver Bertuch (May 19 2025 at 19:20):

Don Sizemore (May 19 2025 at 19:50):

Oliver Bertuch (May 19 2025 at 19:51):

Dorothea Iglezakis (Jun 03 2025 at 11:53):

Oliver Bertuch (Jun 03 2025 at 11:55):

Yuyun Wirawati (Jun 04 2025 at 01:41):

Philip Durbin 🚀 (Jun 16 2025 at 15:21):

Philip Durbin 🚀 (Aug 12 2025 at 13:29):

@Bikram you're going to talk about these bots in a future community call, right? Should we start hyping it? :smile:

Bikram (Aug 12 2025 at 14:54):

haha surprisingly Borealis is getting lesser Bot attention as compared to other service we host, but we are using same preventive measures for Borealis as they share the same front-end proxy.

Philip Durbin 🚀 (Aug 12 2025 at 14:55):

Philip Durbin 🚀 (Aug 12 2025 at 14:57):

@Bikram and do I have it right that you want 10-15 minutes during the next call, in September?

Bikram (Aug 12 2025 at 15:16):

Philip Durbin 🚀 (Aug 12 2025 at 15:18):

Great, Julian is running that call and already created an agenda for it. @Julian Gautier I'm happy to go ahead and add a bullet about bots if you like.

Julian Gautier (Aug 12 2025 at 16:08):

Thanks all. Like I wrote in the email thread I'd like confirm with Jenna first before editing the agenda

Philip Durbin 🚀 (Aug 13 2025 at 19:33):

Yuyun Wirawati (Aug 22 2025 at 05:39):

Philip Durbin 🚀 (Sep 02 2025 at 14:02):

Philip Durbin 🚀 (Sep 02 2025 at 14:05):

Philip Durbin 🚀 (Sep 02 2025 at 14:33):

Bikram (Sep 02 2025 at 15:46):

Philip Durbin 🚀 (Sep 02 2025 at 16:04):

Yuyun Wirawati (Sep 04 2025 at 06:57):

Thanks for the sharing @Bikram can pls share with us if you have tested / implemented the Cloudflare Turnstile on Borealis data download? :nerd:

Bikram (Sep 04 2025 at 12:23):

Hi @Yuyun Wirawati, we have not implemented Cloudflare turnstile yet, because Borealis does not run on single URL. We have another project Odesi which is integrated with Borealis and for Odesi data Lot of schools access Borealis via Ezproxy, so the URL changes for each school. First challenge is to add all the URLs to Turnstile allowlist, second challenge is Free Cloudflare only allow 15 URLs per widget, we were in talks with them to get a 100 URL limit widget but the quotes were too high and did not work for us. We may end of moving Borealis behind Cloudflare Proxy which is included for free.

Oliver Bertuch (Oct 01 2025 at 12:29):

@Don Sizemore @Leo Andreev you have Anubis in place, right? Is one of you still using normal upload instead of S3 Direct Upload? I'm having trouble with Anubis not forwarding error messages from the API when a file upload fails due to size restrictions.

Don Sizemore (Oct 01 2025 at 12:43):

@Oliver Bertuch We have Anubis in place but haven't run into this particular problem (yet)

Oliver Bertuch (Oct 01 2025 at 12:47):

Are you using Anubis in a reverse-proxy mode in between Apache and Dataverse? Also, which version are you running?

Don Sizemore (Oct 01 2025 at 17:08):

Don Sizemore (Oct 03 2025 at 18:49):

In the case of Shibboleth, we provided an explicit Anubis bypass. I think Leonid did this for the API as well?

      ProxyPass ajp://localhost:8009/shib.xhtml
      ProxyPassReverse ajp://localhost:8009/shib.xhtml

Michael Madsen (Oct 06 2025 at 09:28):

Don Sizemore (Oct 06 2025 at 12:44):

@Michael Madsen correct, that's a private repo. you may request access via security@dataverse.org

Frank Smutniak (Oct 21 2025 at 19:33):

Based on the conversations here I see that most are opting for a 3rd party solution - which we will be investigating. Meanwhile I've tried simple rate limiting as provided by setting :RateLimitingDefaultCapacityTiers. We've found that during intense bot activity we will get to the point where the 429 response kicks in.

I was expecting that just the IPs of the bots exceeding set values would be blocked. Instead it appears all non signed in users were getting blocked once the count exceeded settings - including myself via a browser.

Is this expected behavior? Is :RateLimitingDefaultCapacityTiers (and :RateLimitingCapacityByTierAndAction) not intended to deter bots? Are there additional settings other than "-X PUT -d '[x],[y]'" that need to be set? Thanks

Don Sizemore (Oct 21 2025 at 19:57):

@Frank Smutniak yes, I owe Leonid a documentation pull request here. IIRC not-logged-in users are tier 0 and subject to the rate limit, while logged-in users are tier 1 and not subject to any limit. The fix is to manually browse to /loginpage.xhtml and sign in. This was the "nuclear" option in bot protection.

Philip Durbin 🚀 (Oct 21 2025 at 20:01):

@Steven Winship is the expert on rate limiting. He might have some more insight for us.

Steven Winship (Oct 21 2025 at 20:12):

Tier0 is the rate limit for all Guest users. Logged in users can be added to a tier but there is no way to limit by IP address.

Frank Smutniak (Oct 21 2025 at 21:36):

Thanks all. That corresponds with what we saw and we will investigate other options mentioned in this topic.

Stream: community

Topic: Aggressive crawlers / bots

Yuyun Wirawati (Apr 30 2025 at 01:36):

Philip Durbin 🚀 (Apr 30 2025 at 11:00):

Don Sizemore (Apr 30 2025 at 13:40):

Oliver Bertuch (Apr 30 2025 at 13:47):

Oliver Bertuch (Apr 30 2025 at 13:48):

Oliver Bertuch (Apr 30 2025 at 13:50):

Philip Durbin 🚀 (Apr 30 2025 at 13:50):

Oliver Bertuch (Apr 30 2025 at 13:51):

Oliver Bertuch (Apr 30 2025 at 13:53):

Philip Durbin 🚀 (Apr 30 2025 at 13:53):

Oliver Bertuch (Apr 30 2025 at 13:55):

Oliver Bertuch (Apr 30 2025 at 13:56):

Philip Durbin 🚀 (Apr 30 2025 at 13:57):

Oliver Bertuch (Apr 30 2025 at 13:58):

Oliver Bertuch (Apr 30 2025 at 13:58):

Philip Durbin 🚀 (Apr 30 2025 at 14:03):

Don Sizemore (Apr 30 2025 at 14:13):

Oliver Bertuch (Apr 30 2025 at 14:56):

Don Sizemore (Apr 30 2025 at 15:30):

Oliver Bertuch (Apr 30 2025 at 15:34):

Oliver Bertuch (Apr 30 2025 at 15:35):

Oliver Bertuch (Apr 30 2025 at 15:35):

Don Sizemore (Apr 30 2025 at 17:10):

Oliver Bertuch (Apr 30 2025 at 17:14):

Oliver Bertuch (Apr 30 2025 at 17:14):

Oliver Bertuch (Apr 30 2025 at 17:14):

Oliver Bertuch (Apr 30 2025 at 17:15):

Oliver Bertuch (Apr 30 2025 at 17:16):

Don Sizemore (May 08 2025 at 14:25):

Don Sizemore (May 13 2025 at 15:42):

Oliver Bertuch (May 13 2025 at 16:07):

Don Sizemore (May 17 2025 at 11:15):

Oliver Bertuch (May 17 2025 at 11:16):

Oliver Bertuch (May 17 2025 at 11:19):

Don Sizemore (May 17 2025 at 11:38):

Oliver Bertuch (May 17 2025 at 12:00):

Oliver Bertuch (May 17 2025 at 12:01):

Oliver Bertuch (May 17 2025 at 12:01):

Oliver Bertuch (May 17 2025 at 12:02):

Don Sizemore (May 19 2025 at 12:31):

Don Sizemore (May 19 2025 at 13:07):

Don Sizemore (May 19 2025 at 19:17):

Oliver Bertuch (May 19 2025 at 19:20):

Don Sizemore (May 19 2025 at 19:50):

Oliver Bertuch (May 19 2025 at 19:51):

Dorothea Iglezakis (Jun 03 2025 at 11:53):

Oliver Bertuch (Jun 03 2025 at 11:55):

Yuyun Wirawati (Jun 04 2025 at 01:41):

Philip Durbin 🚀 (Jun 16 2025 at 15:21):

Philip Durbin 🚀 (Aug 12 2025 at 13:29):

Bikram (Aug 12 2025 at 14:54):

Philip Durbin 🚀 (Aug 12 2025 at 14:55):

Philip Durbin 🚀 (Aug 12 2025 at 14:57):

Bikram (Aug 12 2025 at 15:16):

Philip Durbin 🚀 (Aug 12 2025 at 15:18):

Julian Gautier (Aug 12 2025 at 16:08):

Philip Durbin 🚀 (Aug 13 2025 at 19:33):

Yuyun Wirawati (Aug 22 2025 at 05:39):

Philip Durbin 🚀 (Sep 02 2025 at 14:02):

Philip Durbin 🚀 (Sep 02 2025 at 14:05):

Philip Durbin 🚀 (Sep 02 2025 at 14:33):

Bikram (Sep 02 2025 at 15:46):

Philip Durbin 🚀 (Sep 02 2025 at 16:04):

Yuyun Wirawati (Sep 04 2025 at 06:57):

Bikram (Sep 04 2025 at 12:23):

Oliver Bertuch (Oct 01 2025 at 12:29):

Don Sizemore (Oct 01 2025 at 12:43):

Oliver Bertuch (Oct 01 2025 at 12:47):

Don Sizemore (Oct 01 2025 at 17:08):

Don Sizemore (Oct 03 2025 at 18:49):

Michael Madsen (Oct 06 2025 at 09:28):

Don Sizemore (Oct 06 2025 at 12:44):

Frank Smutniak (Oct 21 2025 at 19:33):

Don Sizemore (Oct 21 2025 at 19:57):

Philip Durbin 🚀 (Oct 21 2025 at 20:01):

Steven Winship (Oct 21 2025 at 20:12):

Frank Smutniak (Oct 21 2025 at 21:36):