I asked how people are showing the MIT license in Dataverse and I think I think I like the version the @Dieuwertje Bloemen shared slightly better than what's currently in pull request #10426.
I can only emphasize we should use proper SPDX identifiers wherever possible.
It says "uri" and that should probably be "spdx:MIT" then.
In case we really want a URL and not a URI, lets go for https://spdx.org/licenses/MIT (note the avoided .html)
It is a controlled vocabulary after all
There was an issue asking for a better model of how we describe licenses...
Are you thinking of this by @Philipp Conzett ? standardize license configuration #9262
Yes, and especially the discussion in #8512
IIRC the license JSON files proposed in the PR are not yet valid to be included in Dataverse, aye?
In addition we should provide a meta license for the REUSE framework
Right, that JSON format is aspirational.
I'm not sure what you mean for a meta license but for #10426 (adding MIT) the scope will be the existing fields we can populate.
Oliver Bertuch said:
In case we really want a URL and not a URI, lets go for https://spdx.org/licenses/MIT (note the avoided .html)
I'm pretty confused by this. How are we supposed to know not to include .html? How do you know this? I'm poking around at https://spdx.org/licenses/ but I don't see any information about this. And when you click "MIT" you land on https://spdx.org/licenses/MIT.html which seems to be identical to the version without the .html.
SPDX is more or less a controlled vocabulary. schema.org works very similar: https://schema.org/Person is an example
So using https://spdx.org/license/MIT is not a URI (as it is a URL) but comes rather close to a term URI
All URLs are URIs but that's beside the point. :sweat_smile:
But how do you know you can lop off the .html? You simply tried it and it worked?
FYI: I just made the JSON to be as close to the existing creative commons ones. So, if you look at the spdx.org/license pages, you'll see that you find the uri's I filled in on those details pages. For CC-BY-4.0 at https://spdx.org/licenses/CC-BY-4.0.html, you find the creative commons url as desired by e.g. OpenAire. So, that's why I grabbed "https://opensource.org/license/MIT" for the MIT license, as that's the link mentioned on the MIT spdx details page: https://spdx.org/license/MIT
Yeah we definitely have the problem that we are mixing representation and visualization... We want a persistent identifiert (which SPDX provides), but also want a human readable thing. Mixing both in the same file is prone for errors.
Should they post a big note at the top of https://spdx.org/licenses/ that says:
"WARNING: Please remove .html from all the links for these licenses!"
I guess I've been looking at the "name" element as the persistent identifier. As we filled in the SPDX ID in there for our licenses... But I'm sure that's not the only way to do it. It's kinda the issue in general with licenses, there doesn't seem to be a true controlled vocabulary for it anywhere, the SPDX list comes closest and everyone uses it, but still there are a bunch of different url's in circulation for the same licenses, even for the creative commons ones (with or without "/legalcode" attached to the end for example). Luckily most systems can deal with both (e.g. https://guidelines.openaire.eu/en/latest/data/field_rights.html)
It's scary that there are multiple URLs in use for a license listed on SPDX. :scream: Which one should we use?
I don't think there is a clear-cut answer to that question, really :melting_face:
Naively, I would go to the list at https://spdx.org/licenses/ and use the URL they present for a license whether it ends in .html or not.
I wouldn't hack the URL.
I would look to the details page of a license on SPDX to get the url there, as for the creative commons, that's where you find the one OpenAire expects base URL-wise. But again, I don't think either is wrong...
@Dieuwertje Bloemen right, you went for https://opensource.org/license/mit/ rather than https://spdx.org/licenses/MIT.html or https://spdx.org/licenses/MIT
I need a three sided coin to flip.
Philip Durbin said:
I need a three sided coin to flip.
Just use a d&d dice :rolling_on_the_floor_laughing:
Maybe we should add a couple more popular software licenses to make sure we're following the right pattern. One that has a dash and a number in the identifier like Apache-2.0.
We have a bunch of software licenses in our set up. I've quickly grabbed the lot of them from our github, as the github isn't public. The attached zip contains all the software license JSONs we have in our set-up as a possible starting point/example.
Software-License-JSONs-KU-Leuven.zip
Thanks, @Dieuwertje Bloemen !
That is amazing! Thanks @Dieuwertje Bloemen
In License.java name and uri have to be unique:
@Column(columnDefinition="TEXT", nullable = false, unique = true)
private String name;
@Column(columnDefinition="TEXT", nullable = false, unique = true)
private String uri;
I am looking back at this again, I thought that #9262 was a PR that would make a more standardized way to add the licenses, and #10426 was just to provide the license for #248. Is #10426 not even necessary or should we still try to aim to add it for now with what kindly @Dieuwertje Bloemen provided?
I guess a PR based on #9262 will resolve the issue, but it will need some more work than #10426, which, I presume, is based on the current way to specify licenses.
Just learned at the Distribits conference: it might be good to support https://www.w3.org/ns/odrl/2/ in Dataverse for our custom licenses feature
Oliver and I are talking this over in person
Ok, @Juan Pablo Tosca Villanueva and I just wrote up some new guidance. Please take a look: https://github.com/IQSS/dataverse/pull/10426#issuecomment-2050108042
@Philipp Conzett can you please take a look? It definitely relates to this issue you opened: Feature Request/Idea: Standardize standard license configuration #8512
@Dieuwertje Bloemen I'm also interested in what you think.
Please note that we're proposing a slightly different URL for the MIT license:
"uri": "https://opensource.org/license/mit",
Whereas you have this (all caps MIT):
"uri": "https://opensource.org/licenses/MIT",
Thanks, @Philip Durbin , I've left a comment in the PR.
Thanks, @Philipp Conzett ! I'll circle back to your comment soon.
@Juan Pablo Tosca Villanueva that was a very good point from @Julian Gautier during standup, that we don't have control over what non-Dataverse systems are doing when it comes to licenses, especially in the context of harvesting.
A good test would be to see if licenses are harvested or not. We should be able to harvest from https://demo.dataverse.org , which has already been upgraded to 6.2.
I just ran a script that produces a CSV file showing what 63 Dataverse installations are using for licenses:
licenses_used_by_dataverse_installations_2024.04.16_11.23.31.csv
I might've missed a few installations that are using Dataverse versions where setting multiple licenses is possible, but this should be most of them.
I added a column, "license_name(no_dashes)", where all dashes were removed from the "license name" since dashes versus no dashes was an early concern.
Hope this is helpful!
Ah, so we could harvest from some of them as well. Thanks, @Julian Gautier!
Ok, I just replied to the comments on #10426 by @Philipp Conzett and @Dieuwertje Bloemen
@Juan Pablo Tosca Villanueva I just assigned #10426 to you. I like @Dieuwertje Bloemen's idea of using a license other than MIT (which is so simple) in the example in the guides.
Thanks! I will give it a look ASAP :smile:
@Philip Durbin
But it wouldn't match what's in the JSON, right?
"name" would not match https://guides.dataverse.org/en/6.2/_downloads/62ab2ded1364d7e074e284b1f1450dcc/licenseCC-BY-NC-4.0.json right?
Yeah would be a different version, then I should just pick any that is not currently included?
Apache-2.0 maybe?
:salute:
Will do that one
At least there's a dash it in. :grinning:
:rolling_on_the_floor_laughing:
uh oh
it has two "other" websites according to https://spdx.org/licenses/Apache-2.0.html
https://www.apache.org/licenses/LICENSE-2.0 and https://opensource.org/licenses/Apache-2.0
Time to adjust our guidance! :sweat_smile:
tell people to flip a coin?
pick the first one?
May the force be with them
The only way I could think how to standardize this and make it safe is hosting them on something like 'https://www.gdcc.io/licenses/Apache-2.0' :thinking:
And probably we could have a format for submission and people could request a license to be added there
:shrug:
You know now that I am looking at this I am not sure that even for our example on the MIT license is the right thing to do. If we look at the header it clearly states "Other web pages for this license" and the content of the license is on the landing page "https://spdx.org/licenses/MIT.html" which also has this URI "https://spdx.org/licenses/MIT"
Same for Apache, "https://spdx.org/licenses/Apache-2.0"
So our Apache could look like this
{
"name": "Apache-2.0",
"uri": "https://spdx.org/licenses/Apache-2.0",
"shortDescription": "Apache License 2.0",
"active": true,
"sortOrder": 9
}
And the MIT like this:
{
"name": "MIT",
"uri": "https://spdx.org/licenses/MIT",
"shortDescription": "MIT License",
"active": true,
"sortOrder": 8
}
The fact that it states "other web pages" makes me think that they could be 0 - N listed pages and are just considered like external resources but the landing page is the one that is considered the source of the license which has an URI with the same name of the URL -.html so far for what I have seen
To me, it makes sense to pick the first of the the "other web pages" links, as in my experience that is the "proprietary" one, if available (though I'm not 100% sure that one being the first is always the case, as I haven't checked all of them). There are perhaps also cases where there are no "other web pages", so then the spdx url makes sense. We chose to go with the "proprietary" one wherever available, otherwise the opensource.org one or the spdx one as a final option. This because the "proprietary" one is the most interoperable, I believe. But again this goes into the issue that there doesn't seem to be any clear guidance on the standards/standardization of licenses in metadata and their urls. I don't think there is a simple and "correct" options. Perhaps, we should have a look around at what other repository systems choose as licnese urls (including e.g. GitHub)?
This is what Zenodo does:
"rights": [
{
"description": {
"en": "The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited."
},
"icon": "cc-by-icon",
"id": "cc-by-4.0",
"props": {
"scheme": "spdx",
"url": "https://creativecommons.org/licenses/by/4.0/legalcode"
},
"title": {
"en": "Creative Commons Attribution 4.0 International"
}
}
]
In JSON-LD they use "schema.org:license": "https://creativecommons.org/licenses/by/4.0/legalcode"
Same for CodeMeta
For CFF they use the license ID, which is from SPDX:
license:
- cc-by-4.0
For DataCite they use the deep URL and SPDX:
<rightsList>
<rights rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode" rightsIdentifierScheme="spdx" rightsIdentifier="cc-by-4.0">Creative Commons Attribution 4.0 International</rights>
</rightsList>
DataCite doesn't say much but their examples says use the deep URL and SPDX. https://datacite-metadata-schema.readthedocs.io/en/4.5/properties/rights/#rights
So IMHO the cleanest thing here would be adding an spdx identifier and keep the deep URLs you were proposing @Dieuwertje Bloemen
Then the name and id are separated and no longer conflict
@Juan Pablo Tosca Villanueva is the name attribute used to generate the license select list?
IMHO we should not use the id field of the database model for SPDX, as not all of our license will have an SPDX identifier
Or, if it helps, we just use the ID field but annotate if this is following a scheme
I made a script to check on all the licenses they have how many "other" URLs they have and some of them have up to 3 some examples:
For example the Boehm-GC have the following: * https://fedoraproject.org/wiki/Licensing:MIT#Another_Minimal_variant_(found_in_libatomic_ops)
Being the first one "https://fedoraproject.org/wiki/Licensing:MIT#Another_Minimal_variant_(found_in_libatomic_ops)"
So I am not sure there is a way to guarantee that the first one is going to follow some specific patter but as I mentioned each one of these has a URI on SPDX and it includes the content of the license
There is 1 with 8 so far: https://spdx.org/licenses/D-FSL-1.0.html
4 "other links" https://spdx.org/licenses/Cronyx.html
Any thoughts on this @Philip Durbin :smile:
So in total SPDX has 606 licenses available
Probably a coin won't work but a D&D dice would do the trick nicely
Another thing is that if we use the SPDX URI one thing that I like is that there is consistency on the 606 the URI matches the same name as the URL -.html
Well, I'm starting to doubt this line that we added in the PR:
- For the ``uri`` field, go to the SPDX landing page for the license and click on the link under "other web pages for this license". Let any redirection happen and then copy the URL (e.g. ``https://opensource.org/license/mit``) into the ``uri`` field.
Especially if there are 8 links!
The good news those 8 "other" links can be included in a single URI (https://spdx.org/licenses/D-FSL-1.0) that also includes the content and all the info :smiley:
It seems for example in that one with the 8 links they also keep the references to dead links, another thing why I want to use the SPDX URI
https://www.hbz-nrw.de/produkte/open-access/lizenzen/dfsl/D-FSL-1_0_de.txt/at_download/file [no longer live]
https://www.hbz-nrw.de/produkte/open-access/lizenzen/dfsl/D-FSL-1_0_en.txt/at_download/file [no longer live]
But still grandfather the CC licenses?
I would still grandfather them until we figure it out the harvesting results on the license facet
Do you want to go ahead and try harvesting from one of the places Julian mentioned above?
:+1:
I'm interested in knowing what's in the harvested record. I expect it to have the name of the license. But will it have the uri as well? :thinking:
I guess I would suggest our native JSON format to get as much data as possible.
I'm just really worried that harvesters won't interpret the spdx urls correctly. Because that's as far as I know not what they expect to get. For OpenAire, I know that the minimum they expect is a url, this is for example what we do at KU Leuven for the DSpace based open access repository. We don't supply anything other than the license uri and they interpret it correctly to how they represent it in the UI. E.g. they only harvest the url via our DSpace for a record:
<dc:rights>https://creativecommons.org/licenses/by-nc-nd/4.0/</dc:rights>
And in the OpenAire UI for 'Lirias' (name of our Dspace) this is shown as:
image.png
Dieuwertje Bloemen said:
I'm just really worried that harvesters won't interpret the spdx urls correctly. Because that's as far as I know not what they expect to get. For OpenAire, I know that the minimum they expect is a url
I am a bit confused here because I think the original intent, no matter what or which URI we used was to correct and stop using a URL (the parameter on JSON is URI). Even if we used the ones from βother linksβ the proposal was the use of the URI.
I am going to do some testing today with this also test the new license facet added on 6.2 to test what would happen in case a license is translated/localized
Oh so I see the licenses that are translated will appear on Custom Terms
image.png
Even after adding the license to the server, it seems that all the harvested will appear under Custom Terms after reindexing :thinking:
image.png
Following with what @Dieuwertje Bloemen said probably was more not about using a URL vs a URI but more about that they may be doing some parsing but this would happen with any other license that is not being hosted on creativecommons.org if there is any parsing done no matter if we use a URL or a URI. Only CC licenses are being hosted over there so I don't think it would be safe to assume that it will be always a creativecommons.org url but this would be another good point to leave the legacy licenses as they are for now.
So, I looked up examples for the MIT license and apache license in OpenAire and DataCite to make my point a bit clearer (which was; just like for the Creative Commons ones, the SPDX URI/URL isn't used, but one of the 'other links') and found the following two Zenodo records with MIT and Apache licenses:
MIT) https://api.openaire.eu/search/software?doi=10.5281%2Fzenodo.30070 & https://api.datacite.org/dois/10.5281%2Fzenodo.30070
Apache) https://api.openaire.eu/search/software?doi=10.5281%2Fzenodo.10266 & https://api.datacite.org/dois/10.5281%2Fzenodo.10266
For both, the URI provided is not the spdx one, but the 'proprietary' one if available. SPDX base-uri is given only as the "rightsIdentifierScheme" in DataCite, which harps back to the PR by @Philipp Conzett to expand the JSON, which is relevant if as @Oliver Bertuch mentioned, there will be non-SPDX licenses used by some. I think that's an option, or otherwise use the 'proprietary' URI where available as previously agreed on, so not the SPDX one
I probably didn't make myself clear enough above: I agree with @Dieuwertje Bloemen, let's stick with using the "deep" URLs. One of the next things to do (probably out of scope for this PR) is extending the data model to carry an identifier. Which may be SPDX but could also identify other, proprietary licenses, as well as a schema identifier (so SPDX can be identified as such as well as some custom scope or other vocabulary provider). Even if folks are using some different title/description in remote instances, using these two identifiers will allow correct mapping from harvesting/imports.
Hi @Dieuwertje Bloemen', we are not changing the existing licenses on this PR, the idea until now has been to keep the existing licenses as they are. Right now we are trying to figure out a couple of things:
Jim expressed some concerns on slack about using the SPDX landing page since there wasn't a consensus and this was something different from what DataCite does.
Hi @Juan Pablo Tosca Villanueva , I fully understand the scope, I'm just saying that it's better to make the new license guidance be in line with how it has been done so far for the Creative Commons licenses, where the 'deep links' (so what's under "other links" on SPDX) have been used (as it should be according to all external guidance). And I think that's also what should be done for future licenses that are added (to use the "deep links"), as that's what's normally done by other systems (see my previous comment). It doesn't make sense to me to stray from this community standard. There is no true 'official' standard, but there is a common practice to use the 'proprietary' uri of a license where possible, which can typically be found as the first link under "other links" in the SPDX page, it seems.
I don't know if the communication is going awry here somewhere, because I felt like there was a consensus on this about a week ago to use the deep link (see April 18th). It was just not as simple as the SPDX link, because in SPDX they list multiple options under "other links", but I don't think that's a good enough reason to throw the community standard approach out and do something no one else seems to do.
There is human work necessary to make the JSON for a license anyway, so I think just making sure the guidance is clear on what link to choose from the list if there are multiple should solve the issue.
in other words: I agree with Jim and Oliver :sweat_smile:
Regarding the Creative Commons licenses and the 'proprietary' link it works but at least the MIT license is not hosted or shared by the MIT in any official way, the URL provided on SPDX is from https://opensource.org.
I can understand that there has been an approach using the "first link" for a long time but this could be a good time to point people to use this in a standardized way and the "other links" have a lot of inconsistency, personally for me having a precedent is not enough reason to keep doing it the same way.
I think the best we can do at this point is open a thread on Google Groups where it can be viewed and probably listen to the opinion of the community, I will post the link here and I would appreciate it if people can express their opinion in there so probably we can reach some agreement :smile:
@Juan Pablo Tosca Villanueva I just played around with querying DataCite for SPDX identifiers, like Jim just suggested at standup. I wrote about it here: https://github.com/IQSS/dataverse/pull/10426/files#r1576467266
"I'm not sure if I'm doing this right but if you type rightsList.rightsIdentifierScheme:SPDX AND rightsList.rightsIdentifier:Apache-2.0 at https://commons.datacite.org you'll find a number of works that are licensed under Apache 2. This is what I referenced for what to type as a query: https://support.datacite.org/docs/datacite-commons-search . And here's the same as a URL: https://commons.datacite.org/?query=rightsList.rightsIdentifierScheme%3ASPDX+AND+rightsList.rightsIdentifier%3AApache-2.0"
It's an identifier, it works! Success! :heart: :star:
but
When you click on various results, you seem to get different licenses URLs back for Apache 2.
Maybe I'm doing it wrong.
Instead of looking through results of works it would be nice if the DataCite API would tell us "for SPDX:Apache-2.0 our preferred URL is X"
Let me check this :smile:
thank you!
Might be nice to hack a quick script that queries their API, getting all of the results and get a list of the URLs used
But how many results would you check to get that list of URLs?
All of them?
It's an API and a script does the heavy lifting for us :smile_cat:
Almost 13k results at https://commons.datacite.org/?query=rightsList.rightsIdentifierScheme%3ASPDX+AND+rightsList.rightsIdentifier%3AApache-2.0
I think these are just results and they are related to the licenses but these are not the licenses itself :thinking:
correct
I don't think we can bring them all, at the end of the day we have to pick one link as a reference since the JSON takes only one parameter and not an array :thinking:
That's not what I meant
I meant grab all these datasets and take a look at the URL they are using for the license
Count them and go with the most commonly used
Obviously not taking a look by yourself, use a script to scrape the data and do the counting for us...
I guess I was hoping that DataCite is being opinionated, that they picked and blessed one URL for a given SPDX license. Unfortunately, not.
How do you see the link for the license in there? :sweat_smile:
You have to click "export" and pick a format.
As an example on this one https://api.datacite.org/application/vnd.schemaorg.ld+json/10.5284/1000181 the only thing I see on the license is this "http://archaeologydataservice.ac.uk/advice/termsOfUseAndAccess" and I don't see this one even indexed on the filters :thinking:
Prob that is like our custom terms kind of thing?
One thing that I have also been thinking about is using the SPDX URL and standardizing this for example right now as @Dieuwertje Bloemen mentioned about the interpretation of the provided content. As of now if someone is making an interpretation or parsing of the content of the Creative Commons license this wouldn't work with other pages, like "https://opensource.org/license/mit" but if we get all the licenses with the same source this could be done not just for the CC but for all the other licenses that are added.
A question just came in: https://github.com/IQSS/dataverse/pull/9262#issuecomment-2126877068
I just hope βerror parsingβ is not something we are giving as an error :upside_down:
I wouldn't be surprised :sweat_smile:
That statement βerror parsingβ and βunsupported licenseβ should not exist on the same sentence :rolling_on_the_floor_laughing:
Am I right?
Like if there is a parsing error, well there is something wrong with the json
So we shouldnβt be able to validate the license
Yeah, you're right. It's confusing. :thinking:
I see that Jim mentioned the API to list the licenses, I was thinking βdo we really donβt have an API for that?β
Regarding his last point, I donβt think it is a bug since you donβt create the license. Probably would have been better if the parameter was the license ID but that would be a breaking change :upside_down:
https://demo.dataverse.org/api/licenses wasn't in my browser history. I should have mentioned it. I'm glad Jim did.
We might need a new topic for his last point. :grinning:
I will add a comment today/tomorrow but I just want to check what the api does before I say something inaccurate :rolling_on_the_floor_laughing:
So yeah if ANY exception happens during the parsing we catch all the exceptions and return a "parse" error...
Also I am glad I didn't say anything
according to this
If uri is provided, we'll try that first. This is an easier lookup
method; the uri is always the same. The name may have been customized
(translated) on this instance, so we may be dealing with such translated
name, if this is exported json that we are processing. Meaning, unlike
the uri, we cannot simply check it against the name in the License
database table.
Yeah.
I was so tired that I replied that I wrote that :rolling_on_the_floor_laughing:
At least he got the message, I wonder if we should encourage him to update to 6.2 not only 5.1 :thinking:
This is Oliver's installation. He plans to upgrade them to the latest.
@Dieuwertje Bloemen @Philipp Conzett would you like to review this PR?
Add EUPL-1.2, ODbL-1.0, ODC-By-1.0, PDDL-1.0, and OGL-UK-3.0 to the list of standard licenses, clarify add license docs #11522
Last updated: Nov 01 2025 at 14:11 UTC