Stream: python

Topic: dvuploader mime types


view this post on Zulip Oliver Bertuch (Apr 06 2024 at 09:34):

Uploading a PNG file via dvuploader resulted in a file of type plain text???

view this post on Zulip Jan Range (Apr 06 2024 at 14:11):

You need to supply the mimeType for the file. I did experiment with leaving mimeType out of the request body, but it did not work. @Philip Durbin once mentioned that there is a trick to trigger detecting the mime type at Dataverse, but I dont remember exactly. Happy to fix this!

view this post on Zulip Philip Durbin 🚀 (Apr 06 2024 at 14:15):

Please note that it’s possible to “trick” a Dataverse installation into giving a file a content type (MIME type) of your choosing. For example, you can make a text file be treated like a video file with -F 'file=@README.txt;type=video/mpeg4', for example. If the Dataverse installation does not properly detect a file type, specifying the content type via API like this a potential workaround.

view this post on Zulip Philip Durbin 🚀 (Apr 06 2024 at 14:15):

https://guides.dataverse.org/en/6.2/api/native-api.html#add-a-file-to-a-dataset

view this post on Zulip Jan Range (Apr 06 2024 at 14:38):

Works for the native upload now! In the S3 case, it seems not to be possible to leave out the mimeType in the JSON. It will result in a failed registration of each file:

Bad Request: The file content type cannot be determined. <-- Is actually an XML file

I guess that due to the direct upload to S3, no type detection is happening at Dataverse. Is this correct? If so, I would add a step that checks whether each file object has a mime type before the upload is happening.

view this post on Zulip Oliver Bertuch (Apr 06 2024 at 14:41):

IIRC when registering these files you need to provide this metadata. There is also no ingest / analysis / unzip happening when using direct upload

view this post on Zulip Jan Range (Apr 06 2024 at 14:43):

Alright, the mime type is then essential for the upload. I will add a check before uploading.

view this post on Zulip Oliver Bertuch (Apr 06 2024 at 14:47):

It would mean more deps, but would it make sense to have a mime detection library do this for us?

view this post on Zulip Jan Range (Apr 06 2024 at 14:53):

There is one built into Python mimetypes - Covers most of it but has boundaries

view this post on Zulip Jan Range (Apr 06 2024 at 15:01):

An option would be magic but it requires libmagic to be installed, which is not a Python library.

view this post on Zulip Jan Range (Apr 06 2024 at 15:02):

I don't know if this is worth it since there are extra steps required to make it work.

view this post on Zulip maría a. matienzo (Mar 27 2025 at 19:03):

@Jan Range what would you think about some sort of interface here? e.g., dvuploader could ship with mimetypes usage, but if you wanted to use another option (e.g. magic, or something that calls an external tool)

view this post on Zulip Jan Range (Mar 27 2025 at 20:12):

Sure, that is a great idea. As far as I know, the magic package requires the libmagic binaries.

An alternative would be to port infer from Rust via Python bindings. This way, users do not need to install these manually, and we can ship the interface without the need to install libmagic. Maybe there are some existing already. Otherwise, it is quite easy to set these up - The crate is quite simple.

view this post on Zulip maría a. matienzo (Mar 27 2025 at 20:16):

yeah, there's a similar approach used in the marcel gem as used by Ruby on Rails - it uses the signatures from Apache Tika without otherwise adding a dependency on Tika itself.

view this post on Zulip Jan Range (Mar 28 2025 at 08:48):

There are no bindings yet, but I have created a simple one that guesses the mime type.

image.png

view this post on Zulip Jan Range (Mar 28 2025 at 08:50):

It is exclusive to binary formats and fails at CSV and other text-based ones. I can either include another one, or we can simply combine it with Python mimetypes.


Last updated: Nov 01 2025 at 14:11 UTC