Stream: community

Topic: Aggressive crawlers / bots


view this post on Zulip Yuyun Wirawati (Apr 30 2025 at 01:36):

I'd like to learn from other installations, on how they manage the aggressive bots that ignore robot.txt and crawl-delay. The IP blocking is not working as the AI bots are using legits IP as proxy.
We are currently exploring captcha or silent challenge on WAF.

related news, posts, etc:
https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/
https://www.linkedin.com/pulse/bots-coming-our-metadata-rosalyn-metz-0kyve/

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 30 2025 at 11:00):

Thanks for also starting the thread on the mailing list about these bots.

view this post on Zulip Don Sizemore (Apr 30 2025 at 13:40):

@Oliver Bertuch I am very interested to hear what you're doing to combat botnets. I've found at least four startup companies selling bot-detection-evasion-as-a-service =(

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 13:47):

Try https://anubis.techaro.lol or https://git.gammaspectra.live/git/go-away :smiley:

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 13:48):

See Anubis in action at https://data.fz-juelich.de

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 13:50):

image.png

And that's just from the last ~6 hours

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 30 2025 at 13:50):

yikes

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 13:51):

These people are crazy. I'm thinking about using go-away's proxy mode to lure the failing clients into a Nephentes tar pit. So let them burn CPU on nonsense and poison their training materials

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 13:53):

I had to turn off the Opengraph Preview Passthrough in Anubis though. They just kept on hammering on the root collection. :shrug:

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 30 2025 at 13:53):

Meanwhile, I'm playing with a top secret AI tool from @Slava Tykhonov and contributing to the problem. :sweat_smile:

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 13:55):

I might let those AI bots with proper user agents back in. They seem to behave. Will need to verify their origin IPs though.

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 13:56):

In addition, thinking about adding a cache layer to my NGINX Ingress so I can re-enable OpenGraph previews.

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 30 2025 at 13:57):

When you say hammering the root collection do you mean the homepage?

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 13:58):

We don't have a static home page, we just present the root collection. So any visitor triggers a Solr search.

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 13:58):

Solr seemed to be fine with that, but the performance of Dataverse loading the required stuff from the result just wasn't there. It couldn't keep up.

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 30 2025 at 14:03):

Gotcha. Hitting the homepage does call into the database as well.

view this post on Zulip Don Sizemore (Apr 30 2025 at 14:13):

@Oliver Bertuch Anubis doesn't cause any problems with callbacks?

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 14:56):

Don Sizemore said:

Oliver Bertuch Anubis doesn't cause any problems with callbacks?

I have no idea! Didn't test that yet. How would you suggest testing it?

view this post on Zulip Don Sizemore (Apr 30 2025 at 15:30):

OAuth2 / DataCite publishing, anything that would call back.
Also, you're using Nginx and not Apache, correct?

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 15:34):

Yes, I'm using NGINX (Ingress Controller). But Anubis and Go-Away are usable with anything, as they just are reverse proxies.

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 15:35):

OAuth2 happens in the browser. As you solve the challenge once and then have a JWT in a cookie, that's no problemo

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 15:35):

What kind of DataCite callbacks are you referring to? Registering a new DOI etc is just outgoing traffic, aye?

view this post on Zulip Don Sizemore (Apr 30 2025 at 17:10):

@Oliver Bertuch I'm glad to hear it's all working for you! My remaining question concerns our use of Shibboleth, and subsequent use of the AJP protocol through the Apache proxy. I'll look into this.

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 17:14):

I'm using it like this: NGINX Ingress -> Anubis -> NGINX Router -> Dataverse Backend/Static Files/...

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 17:14):

So you could just front your Apache with Anubis.

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 17:14):

That obviously leaves the question of SSL termination

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 17:15):

If you do that in Apache right now, you might need to split the setup a little

view this post on Zulip Oliver Bertuch (Apr 30 2025 at 17:16):

Or take a look at go-away, they have SSL stuff builtin including TLS fingerprinting

view this post on Zulip Don Sizemore (May 08 2025 at 14:25):

We don't have Shibboleth in the mix pending InCommon Federation publication, but Anubis seems to work well with Apache and Dataverse. Instead of specifying a DocumentRoot in the last listener, we gave it the standard ProxyPass:

<VirtualHost *:3000>
  ServerAdmin <e-mail>
  ServerName <fqdn>
  ErrorLog /var/log/httpd/anubis_dataverse6_error.log
  CustomLog /var/log/httpd/anubis_dataverse6_access.log combined

  ProxyPass / ajp://localhost:8009/
  ProxyPassReverse / ajp://localhost:8009/
</VirtualHost>

view this post on Zulip Don Sizemore (May 13 2025 at 15:42):

@Oliver Bertuch William got Anubis working with Shibboleth! Our current ssl.conf includes:

  # these headers are required for Anubis
  RequestHeader set "X-Real-Ip" expr=%{REMOTE_ADDR}
  RequestHeader set X-Forwarded-Proto "https"
  ProxyPreserveHost On
  ProxyRequests Off
  ProxyVia Off

  <Location /shib.xhtml>
    AuthType shibboleth
    ShibRequestSetting requireSession 1
    require valid-user

    ProxyPass ajp://localhost:8009/shib.xhtml
    ProxyPassReverse ajp://localhost:8009/shib.xhtml
  </Location>

  # port Anubis listens on
  ProxyPass / http://localhost:8923/
  ProxyPassReverse / http://localhost:8923/
</VirtualHost>

# actual website config
<VirtualHost *:3000>
  ServerAdmin <e-mail>
  ServerName <fqdn>
  ErrorLog /var/log/httpd/anubis_dataverse6_error.log
  CustomLog /var/log/httpd/anubis_dataverse6_access.log combined

  ProxyPass / ajp://localhost:8009/
  ProxyPassReverse / ajp://localhost:8009/
</VirtualHost>

view this post on Zulip Oliver Bertuch (May 13 2025 at 16:07):

Looks great! Congrats! May their connections be weighed

view this post on Zulip Don Sizemore (May 17 2025 at 11:15):

@Oliver Bertuch we're having trouble with Anubis' ALLOWs in bot_policies.json - are you willing to share yours?

view this post on Zulip Oliver Bertuch (May 17 2025 at 11:16):

Orly? What kind of problems? I'm using the standard config, but can also share it.

view this post on Zulip Oliver Bertuch (May 17 2025 at 11:19):

## Anubis has the ability to let you import snippets of configuration into the main
## configuration file. This allows you to break up your config into smaller parts
## that get logically assembled into one big file.
##
## Of note, a bot rule can either have inline bot configuration or import a
## bot config snippet. You cannot do both in a single bot rule.
##
## Import paths can either be prefixed with (data) to import from the common/shared
## rules in the data folder in the Anubis source tree or will point to absolute/relative
## paths in your filesystem. If you don't have access to the Anubis source tree, check
## /usr/share/docs/anubis/data or in the tarball you extracted Anubis from.

bots:
  # Pathological bots to deny
  - # This correlates to data/bots/ai-robots-txt.yaml in the source tree
    import: (data)/bots/ai-robots-txt.yaml
  - import: (data)/bots/cloudflare-workers.yaml
  - import: (data)/bots/headless-browsers.yaml
  - import: (data)/bots/us-ai-scraper.yaml

  # Search engines to allow
  - import: (data)/crawlers/googlebot.yaml
  - import: (data)/crawlers/bingbot.yaml
  - import: (data)/crawlers/duckduckbot.yaml
  - import: (data)/crawlers/qwantbot.yaml
  - import: (data)/crawlers/internet-archive.yaml
  - import: (data)/crawlers/kagibot.yaml
  - import: (data)/crawlers/marginalia.yaml
  - import: (data)/crawlers/mojeekbot.yaml

  # Allow common "keeping the internet working" routes (well-known, favicon, robots.txt)
  - import: (data)/common/keep-internet-working.yaml

  # # Punish any bot with "bot" in the user-agent string
  # # This is known to have a high false-positive rate, use at your own risk
  # - name: generic-bot-catchall
  #   user_agent_regex: (?i:bot|crawler)
  #   action: CHALLENGE
  #   challenge:
  #     difficulty: 16  # impossible
  #     report_as: 4    # lie to the operator
  #     algorithm: slow # intentionally waste CPU cycles and time

  # Generic catchall rule
  - name: generic-browser
    user_agent_regex: >-
      Mozilla|Opera
    action: CHALLENGE

dnsbl: false

view this post on Zulip Don Sizemore (May 17 2025 at 11:38):

@Oliver Bertuch so you haven't had to specifically allow for example ^/api.* ? (for us, it was shib.xhtml, we had to tell Apache to bypass Anubis for that location. Shib wouldn't work even with shib.xhtml.* set to ALLOW

view this post on Zulip Oliver Bertuch (May 17 2025 at 12:00):

We're not using Shib, so your mileage vary.

view this post on Zulip Oliver Bertuch (May 17 2025 at 12:01):

I didn't have to do any of that, because most clients going to the API only will use not Mozilla or Opera as their user agents.

view this post on Zulip Oliver Bertuch (May 17 2025 at 12:01):

If someone uses their browser to access the API, it seems unlikely they go directly to it, but probably will go through the UI first.

view this post on Zulip Oliver Bertuch (May 17 2025 at 12:02):

Then they have the solved challenge cached and can send it along with the request

view this post on Zulip Don Sizemore (May 19 2025 at 12:31):

@Oliver Bertuch my grad student worked around that with:

  <Location /shib.xhtml>
    AuthType shibboleth
    ShibRequestSetting requireSession 1
    require valid-user

    ProxyPass ajp://localhost:8009/shib.xhtml
    ProxyPassReverse ajp://localhost:8009/shib.xhtml
  </Location>

view this post on Zulip Don Sizemore (May 19 2025 at 13:07):

though I think I see a bit of missing configuration which should fix our excludes problem.

view this post on Zulip Don Sizemore (May 19 2025 at 19:17):

@Oliver Bertuch I did find that Anubis broke one other thing: dataverse-previewers. We switched course with Anubis, and told it only to challenge /dataverse/ /dataverse.xhtml and dataset.xhtml (and kept the explicit AJP bypass in the shib.xhtml location block). FWIW the test server is seeing a botnet matching Anubis' "aggressive Brazilian scrapers" criteria: https://github.com/TecharoHQ/anubis/blob/main/data/bots/aggressive-brazilian-scrapers.yaml I am very nearly ready to implement this on UNC Dataverse.

view this post on Zulip Oliver Bertuch (May 19 2025 at 19:20):

I don't have previewers yet, so I didn't notice that one! Good catch! Let the folks realm through the wheat of Duat freely. :relieved:

view this post on Zulip Don Sizemore (May 19 2025 at 19:50):

are you using direct upload/download? (that seems to work fine)

view this post on Zulip Oliver Bertuch (May 19 2025 at 19:51):

Nope, not yet. Will need to build an S3 proxy first.

view this post on Zulip Dorothea Iglezakis (Jun 03 2025 at 11:53):

Thanks a lot for this thread. We are facing the same problem and will try https://anubis.techaro.lol/.

view this post on Zulip Oliver Bertuch (Jun 03 2025 at 11:55):

If you need something brandable now, take a look at https://git.gammaspectra.live/git/go-away. Techaro wanted to make Anubis brandable but as a paid option only.

view this post on Zulip Yuyun Wirawati (Jun 04 2025 at 01:41):

We found disabling Solr Facet works reasonably well https://guides.dataverse.org/en/latest/installation/config.html#disablesolrfacetswithoutjsession (did a manual adjustment since our version does not have this functionality). Thanks to Jim Myers for suggesting.

view this post on Zulip Philip Durbin ๐Ÿš€ (Jun 16 2025 at 15:21):

Related: https://library.unc.edu/news/library-it-vs-the-ai-bots/

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 12 2025 at 13:29):

@Bikram you're going to talk about these bots in a future community call, right? Should we start hyping it? :smile:

view this post on Zulip Bikram (Aug 12 2025 at 14:54):

haha surprisingly Borealis is getting lesser Bot attention as compared to other service we host, but we are using same preventive measures for Borealis as they share the same front-end proxy.

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 12 2025 at 14:55):

That's good.

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 12 2025 at 14:57):

@Bikram and do I have it right that you want 10-15 minutes during the next call, in September?

view this post on Zulip Bikram (Aug 12 2025 at 15:16):

yes, Amber suggested doing so

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 12 2025 at 15:18):

Great, Julian is running that call and already created an agenda for it. @Julian Gautier I'm happy to go ahead and add a bullet about bots if you like.

view this post on Zulip Julian Gautier (Aug 12 2025 at 16:08):

Thanks all. Like I wrote in the email thread I'd like confirm with Jenna first before editing the agenda

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 13 2025 at 19:33):

Thanks for confirming with Jenna and updating the agenda! Looks like we're on!

view this post on Zulip Yuyun Wirawati (Aug 22 2025 at 05:39):

A critical remark on Anubis: https://lock.cmpxchg8b.com/anubis.html

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 02 2025 at 14:02):

We're starting the call that includes @Bikram on bots! https://docs.google.com/document/d/1daw0hHWOtd3PtSF-meM1BN6nK8JwoYhbTChUAiC-GwQ/edit?usp=sharing

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 02 2025 at 14:05):

Screenshot 2025-09-02 at 10.04.44โ€ฏAM.png

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 02 2025 at 14:33):

@Bikram great talk and thanks for posting your slides! :dataverse_man:

view this post on Zulip Bikram (Sep 02 2025 at 15:46):

thank you @Philip Durbin ๐Ÿš€ :)

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 02 2025 at 16:04):

Sure, here's the recording!

view this post on Zulip Yuyun Wirawati (Sep 04 2025 at 06:57):

Thanks for the sharing @Bikram can pls share with us if you have tested / implemented the Cloudflare Turnstile on Borealis data download? :nerd:

view this post on Zulip Bikram (Sep 04 2025 at 12:23):

Hi @Yuyun Wirawati, we have not implemented Cloudflare turnstile yet, because Borealis does not run on single URL. We have another project Odesi which is integrated with Borealis and for Odesi data Lot of schools access Borealis via Ezproxy, so the URL changes for each school. First challenge is to add all the URLs to Turnstile allowlist, second challenge is Free Cloudflare only allow 15 URLs per widget, we were in talks with them to get a 100 URL limit widget but the quotes were too high and did not work for us. We may end of moving Borealis behind Cloudflare Proxy which is included for free.

view this post on Zulip Oliver Bertuch (Oct 01 2025 at 12:29):

@Don Sizemore @Leo Andreev you have Anubis in place, right? Is one of you still using normal upload instead of S3 Direct Upload? I'm having trouble with Anubis not forwarding error messages from the API when a file upload fails due to size restrictions.

view this post on Zulip Don Sizemore (Oct 01 2025 at 12:43):

@Oliver Bertuch We have Anubis in place but haven't run into this particular problem (yet)

view this post on Zulip Oliver Bertuch (Oct 01 2025 at 12:47):

Are you using Anubis in a reverse-proxy mode in between Apache and Dataverse? Also, which version are you running?

view this post on Zulip Don Sizemore (Oct 01 2025 at 17:08):

our particular config is written up here as a sample: https://github.com/IQSS/dataverse-security/wiki/Protecting-Dataverse-with-Anubis additions and corrections always welcome.

view this post on Zulip Don Sizemore (Oct 03 2025 at 18:49):

Oliver Bertuch said:

Are you using Anubis in a reverse-proxy mode in between Apache and Dataverse? Also, which version are you running?

In the case of Shibboleth, we provided an explicit Anubis bypass. I think Leonid did this for the API as well?

      ProxyPass ajp://localhost:8009/shib.xhtml
      ProxyPassReverse ajp://localhost:8009/shib.xhtml

view this post on Zulip Michael Madsen (Oct 06 2025 at 09:28):

Don Sizemore said:

our particular config is written up here as a sample: https://github.com/IQSS/dataverse-security/wiki/Protecting-Dataverse-with-Anubis additions and corrections always welcome.

I'm getting a 404 on that url @Don Sizemore

view this post on Zulip Don Sizemore (Oct 06 2025 at 12:44):

@Michael Madsen correct, that's a private repo. you may request access via security@dataverse.org

view this post on Zulip Frank Smutniak (Oct 21 2025 at 19:33):

I'm new to bot mitigation and find this thread useful!

Based on the conversations here I see that most are opting for a 3rd party solution - which we will be investigating. Meanwhile I've tried simple rate limiting as provided by setting :RateLimitingDefaultCapacityTiers. We've found that during intense bot activity we will get to the point where the 429 response kicks in.

I was expecting that just the IPs of the bots exceeding set values would be blocked. Instead it appears all non signed in users were getting blocked once the count exceeded settings - including myself via a browser.

Is this expected behavior? Is :RateLimitingDefaultCapacityTiers (and :RateLimitingCapacityByTierAndAction) not intended to deter bots? Are there additional settings other than "-X PUT -d '[x],[y]'" that need to be set? Thanks

view this post on Zulip Don Sizemore (Oct 21 2025 at 19:57):

@Frank Smutniak yes, I owe Leonid a documentation pull request here. IIRC not-logged-in users are tier 0 and subject to the rate limit, while logged-in users are tier 1 and not subject to any limit. The fix is to manually browse to /loginpage.xhtml and sign in. This was the "nuclear" option in bot protection.

view this post on Zulip Philip Durbin ๐Ÿš€ (Oct 21 2025 at 20:01):

@Steven Winship is the expert on rate limiting. He might have some more insight for us.

view this post on Zulip Steven Winship (Oct 21 2025 at 20:12):

Tier0 is the rate limit for all Guest users. Logged in users can be added to a tier but there is no way to limit by IP address.

view this post on Zulip Frank Smutniak (Oct 21 2025 at 21:36):

Thanks all. That corresponds with what we saw and we will investigate other options mentioned in this topic.


Last updated: Nov 01 2025 at 14:11 UTC