Stream: dev

Topic: metadata update


view this post on Zulip Philipp Conzett (Sep 20 2023 at 11:31):

I've created some API scripts to update the metadata of multiple datasets. I'm now testing the scripts in our test environment, which currently is a copy of production from yesterday, but we switched the DOIs from production DOIs (10.18710) to our test DOIs (10.21337). Here's what I've been doing:

  1. Download JSON with metdata from test.
  2. Run the script on all these JSON files. Looks good.
  3. Upload the updated JSON files to test.
    But the files wouldn't upload. Is this because the metadata still refers to the production DOI? See, e.g.:

*{
"id": 2966,
"datasetId": 175757,
"datasetPersistentId": "doi:10.21337/MXCA5S",
"storageIdentifier": "S3://10.18710/MXCA5S",
"versionNumber": 1,
"versionMinorNumber": 0,
"versionState": "RELEASED",*

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 20 2023 at 11:35):

Hmm, what errors do you get from the client side (curl, python, etc.)?

And what errors do you get in server.log?

view this post on Zulip Philipp Conzett (Sep 20 2023 at 11:59):

Thanks! No errors in the command line. I'll need to ask our devops for the server.log. I'm now trying this on production for one dataset without publishing.

view this post on Zulip Philipp Conzett (Sep 20 2023 at 12:02):

I forgot to mention, we only copied the metadata, not the data to test.

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 20 2023 at 12:06):

What if you try downloading the JSON from a dataset in your test environment and try to make a change? Does that work? Just a simple change like an edit to the description or something.

view this post on Zulip Philipp Conzett (Sep 20 2023 at 12:32):

Yes, that's basically what I've been doing, but uploading the modified JSON file does not work :-/

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 20 2023 at 12:40):

Maybe we should get you set up with a dev environment on your laptop so you can see server.log. :big_smile:

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 20 2023 at 12:41):

@Oliver Bertuch what do you think? Is it time for Docker?

view this post on Zulip Oliver Bertuch (Sep 20 2023 at 12:42):

Probably...?

view this post on Zulip Oliver Bertuch (Sep 20 2023 at 12:43):

Easiest way to setup a clean environment

view this post on Zulip Philipp Conzett (Sep 20 2023 at 13:14):

Yes, but the idea was to test it on ~identical datasets before we run it on prod.

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 20 2023 at 13:33):

What if I tried to import your prod JSON into my dev environment? Would that be a good test? I'm running the tip of the develop branch.

view this post on Zulip Philipp Conzett (Sep 20 2023 at 13:38):

Thanks, I might want to do that. Let me just first test a couple of datasets on prod.

view this post on Zulip Philipp Conzett (Sep 20 2023 at 13:47):

In the same clean-up job, I'll be publishing new versions of about 800 datasets. From previous, similar jobs (e.g. uploading many files to a dataset via API), I've learned to put a sleep command after each API publishing command. This means running the script will take about 8-10 hours. The idea is to disable login during that time. Now, to reduce the work load, I'm considering turning off file validation, like this:

Before the script is run:
curl -X PUT -d 'false' http://localhost:8080/api/admin/settings/:FileValidationOnPublishEnabled

After the script is run:
curl -X PUT -d 'true' http://localhost:8080/api/admin/settings/:FileValidationOnPublishEnabled

None of the changes are at file level. Are there any concerns turning off file validation in this case?

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 20 2023 at 13:49):

I don't think so. And I saw your mailing list post. You're only changing metadata, not files. Should be fine.

view this post on Zulip Philipp Conzett (Sep 20 2023 at 13:53):

Great, thanks, good to get this confirmed. I think turning it off will make the process smoother and faster.


Last updated: Nov 01 2025 at 14:11 UTC