Stream: dev

Topic: Operating on non data files in a storage independent manner


view this post on Zulip Balázs Pataki (Jan 19 2024 at 16:59):

This is again related to RO-Crate. So far we worked with a local filesystem storage and so we could easily create, edit, delete ro-crate-metatadata.json files in the same directory where the data files of a dataset are stored.

However, we want to support any storage type that is available in a Dataverse installation.

So, my question is: how can we manage non-data files that are store along with data files in a dataset in a storage independent manner?

For example, if a dataverse is configured to use S3 storage, how can I create an ro-crate-metatadata.json file in a dataset of that dataverse? I think I somehow need to get access to a StorageIO subclass matching the configured storage of the dataverse ofmy target dataset, eg. S3AccessIO in my example. Given that I have a Dataset object how can I get an S3AccessIO to manage my ro-crate-metatadata.json (create, edit, rename, delete)? Should I maybe use the AuxiliaryFile mechanism? But as far as I understand, an AuxiliaryFile is related to a DataFile and not a Dataset.

view this post on Zulip Balázs Pataki (Jan 19 2024 at 17:23):

Maybe the way files like export_OAI_ORE.cached are handled?

https://github.com/IQSS/dataverse/blob/df318f0c54dba0e216f4e06c8cf80b38e3533876/src/main/java/edu/harvard/iq/dataverse/export/ExportService.java#L372-L375

view this post on Zulip Philip Durbin 🚀 (Jan 22 2024 at 15:10):

Yes, auxiliary files are associated with data files, not datasets.

view this post on Zulip Philip Durbin 🚀 (Jan 22 2024 at 15:12):

I'm a little confused. How is S3 so different than local? You should be able to create a JSON file for either one... unless, are you saying you aren't creating the JSON file with Dataverse? You're using some other process.

Can you please link to an example of how it all looks with local files? It sounds like you want to replicate that for S3.

view this post on Zulip Balázs Pataki (Jan 22 2024 at 16:12):

To put simple: I just want to add a random file next to the datafiles, no matter where the dataset is stored (locally, in S3, Swift, etc.).

I think I was looking for the StorageIO interface and methods like openAuxChannel(), getAuxFileAsInputStream(), etc.

view this post on Zulip Philip Durbin 🚀 (Jan 22 2024 at 16:22):

Ok, and you don't want this file to be entered into the database? It just sits there with Dataverse not knowing about it?

view this post on Zulip Balázs Pataki (Jan 22 2024 at 16:23):

Yes. This is not a datafile, but a "random file" (ro-crate-metadata.json), which lives with the dataset. Much like the thumbnails, cache files, etc. that are already handled this way by Dataverse.

view this post on Zulip Philip Durbin 🚀 (Jan 22 2024 at 16:42):

I'm not sure but I'm asking internally.

view this post on Zulip Philip Durbin 🚀 (Jan 22 2024 at 17:44):

Going back to auxiliary files, what if you associated your ro-crate json file with one of the data files? Would that be a problem? Maybe there could always be a README.md or something.

view this post on Zulip Philip Durbin 🚀 (Jan 22 2024 at 19:30):

I'm chatting with @Leo Andreev and Jim a bit.

view this post on Zulip Philip Durbin 🚀 (Jan 22 2024 at 19:31):

Can you use the standard exporter framework?

view this post on Zulip Philip Durbin 🚀 (Jan 22 2024 at 19:31):

Otherwise we might need to extend the idea of aux files to datasets.

view this post on Zulip Philip Durbin 🚀 (Jan 22 2024 at 19:39):

This probably goes without saying, but I assume it's all related to your PR #10086. But now you want it to work with S3.

view this post on Zulip Balázs Pataki (Jan 23 2024 at 09:04):

Yes, it is all in the context of RO-Crate handling. The RO-Crate metadata belongs to the dataset not a datafile, so it would be awkward to artificially join it to an adhoc datafile, like README.md, I think.

I think the "aux" things in StorageIO have nothing to do with the AuxiliaryFile objects and their handling. But it would be great if you could confirm it.

view this post on Zulip Philip Durbin 🚀 (Jan 23 2024 at 12:08):

Well, the naming is confusing. They can be related. We have a discussion on Slack about this recently. AuxiliaryFile objects do make use of those "aux" methods in StorageIO. However, those "aux" methods have been around a long time and are used for a number of things besides AuxiliaryFile objects, such as thumbnails, exports, and provenance files.

I know it's confusing! :sweat_smile:

view this post on Zulip Philip Durbin 🚀 (Jan 23 2024 at 13:45):

But what about using the standard exporter framework? Will that work for you?

view this post on Zulip Balázs Pataki (Jan 23 2024 at 13:50):

What I try to achieve here is not actually part of #10086, there we don't need this, because there we only generate the ro-crate-metadata.json for the latest version of the dataset (we actually cache it in the filesystem, but it would work without the cache as well).

However, in our custom Dataverse installation we want to keep the ro-crate-metadata.json for all versions of the dataset and besides ro-crate-metadata.json we also store ro-crate-preview.html as well. So, here we need to manage all these "aux" files.

view this post on Zulip Philip Durbin 🚀 (Jan 23 2024 at 13:52):

I see. It still feels like the exporter framework is close to what you need. Maybe it could be extended somehow?


Last updated: Nov 01 2025 at 14:11 UTC