Questions about Turbo Curator · community

Stream: community

Topic: Questions about Turbo Curator

Deirdre Kirmis (Jul 25 2024 at 21:04):

Hi all .. this is kind of a strange question, but wondering if anyone has any insight into ICPSR Turbo Curator inner workings? We are working on a project to try to determine the usefulness of ChatGPT for generating metadata for other applications that we support, similar to what Turbo Curator does. We are not trying to build anything or get dev secrets but just general info for how it works (ie: does it use OpenAI API?) and what the perceived usefulness of it is for datasets. For instance, do folks feel that it generates adequate keywords, description, etc.? We would use this for a comparison to what our findings are in using it for our testing of some of our collection documents (not Dataverse). I've asked a few people individually but otherwise honestly do not know who else might know this info (specific to dataverse).

Philip Durbin 🚀 (Jul 25 2024 at 21:12):

Yes, it uses OpenAI.

"TurboCurator generates recommendations powered by OpenAI’s ChatGPT technology and ICPSR’s suggestion logic." -- https://turbocurator.icpsr.umich.edu/tc/adminabout

Via https://guides.dataverse.org/en/6.3/admin/external-tools.html#inventory-of-external-tools

Oliver Bertuch (Jul 25 2024 at 21:27):

Maybe one of their next steps could be enable usage of other providers? We're curious to use this, but we'd certainly want to use Blablador with it (e.g. using Llama 3.1 as one of the models offered), not OpenAI.

Oliver Bertuch (Jul 25 2024 at 21:29):

I'm not sure if they are using a custom build RAG together with ChatGPT, which would probably increase the quality of generated keywords etc.

Deirdre Kirmis (Jul 25 2024 at 22:21):

ah thanks .. i guess i meant, does it use the OpenAI API in a script on the server to make calls to chatgpt .. or does it use python sdk or something else? still trying to figure out where to start there ..
also, just wondering if there is any stats on the user perception of how well it generates the title, description, keywords based on it's analysis of the dataset metadata .. does it get it right or miss some? we are trying to compare what chatgpt does vs what a human would do

Deirdre Kirmis (Jul 25 2024 at 23:08):

and I think "Ask the Data" probably is doing a similar thing? Sending the file to ChatGPT via API and results are sent back to the UI?

Philip Durbin 🚀 (Jul 26 2024 at 13:07):

Yes, if you look at the Ask the Data README at https://github.com/IQSS/askdataverse you'll see that it also uses OpenAI.

Philip Durbin 🚀 (Jul 26 2024 at 13:08):

TurboCurator is so new I doubt anyone has done a proper study of its recommendations. However, Dataverse installations are encouraged to install it and try it as a first step. :grinning:

Deirdre Kirmis (Jul 26 2024 at 15:28):

Oh yea, that is super helpful! Sorry I missed that. I see they use LangChain which is something we were looking into as well. Thanks!

Kelly Doonan-Reed (Jul 26 2024 at 19:59):

Deirdre - I am Kelly Doonan-Reed the Product Owner from ICPSR that supports TurboCurator. I consulted with my Technical Team today. Adding the response from the team.

Kelly Doonan-Reed (Jul 26 2024 at 20:01):

TurboCurator is a web-based tool built using Azure’s version of Open AI’s Chat GPT that generates recommendations for title, summary and keywords. We are using Azure’s learning model in OpenAI. We built basic prompts and focused on prompt engineering. The team built prompts in Open AI’s Chat GPT to provide suggestions for Title, Summary & Keywords. Yes, we used Azure’s Open AI java libraries to create APIs that display recommendations in the TurboCurator UI. Our prompt’s are based on ICPSRs best practices for metadata.

In our testing of TurboCurator, we had to hone ChatGPT so that it was not too restrictive (did not generate any suggestions) or too creative (generated suggestions that did not make sense).

For keywords recommendations in addition to building prompts, we added both pre-processing and post-processing. The pre-processing used a full text search to get candidate keywords from the ICPSR thesaurus and included in the prompt. The post-processing checks the validity of the keywords returned from the ChaptGPT API to prevent AI “hallucination” (creative recommendations). If an original keyword entry was included in the ICPSR Thesaurus we continue to display it even if ChatGPT dropped it.

The human user is presented with recommendations. They can edit these recommendations and decide if they want to copy them back to their dataset.

We are looking for feedback from TurboCurator users on the quality of suggestions generated.

Please go ahead and play around and let us know what you think via the Provide feedback. I am a fan of the Dataverse demo space and it is a safe space to experiment with TurboCurator to learn more. In the TurboCurator tool itself, check out the help for each metadata field to see how we generate its recommendations from ChatGPT, e.g., what rules we use to create a title recommendation.

TurboCurator was featured during the [January DataverseTV]

Kelly Doonan-Reed (Jul 26 2024 at 20:02):

Other Resources:

Dataverse Administrator About TurboCurator

FAQ

Provide Feedback

Philip Durbin 🚀 (Jul 26 2024 at 20:42):

Welcome, @Kelly Doonan-Reed and thanks for the detailed response!

Deirdre Kirmis (Jul 26 2024 at 20:51):

Hi Kelly .. thank you so much for your response! This is a amazing information, and I appreciate it immensely. It is so very helpful to us. We are really just getting started on this journey and it is helpful to know how some other tools are utilizing models and building prompts, talking to Chat GPT, etc.

We have spent some time messing around with building prompts and various ways to "train" the GPT to get more specific answers, etc. We have done some command line scripting to interact with it, but have a long way to go on that. A big challenge that we will face is that we would like the "tool" that we build to actually cycle through files in the repository and analyze them to provide metadata, but that does not seem to be an easy task! We are also wondering about accuracy and variations in those responses and how much human involvement will still be necessary.

I did watch your Dataverse TV video and that was very helpful as well! The demo site has been very useful in testing things out as well and we do have Turbo Curator installed on our QA and Prod sites. Your pre-processing and post-processing steps are very interesting and seem most important .. I will check out the help in the fields and test it more. I will send some feedback on our experience soon.

Again, thank you, thank you!

Kelly Doonan-Reed (Jul 26 2024 at 23:13):

:+1:

Oliver Bertuch (Jul 27 2024 at 15:39):

Heard this episode today with Bruce Hopkins who also released a book recently on ChatGPT with Java. airhacks.fm podcast with adam bien: From J2ME, over Bluetooth and Speech Recognition to AI

Webseite der Episode: http://airhacks.fm/

Mediendatei: https://s3.eu-central-1.amazonaws.com/airhacks.fm/airhacksfm_304.mp3

Oliver Bertuch (Jul 27 2024 at 15:40):

Here's the book: https://javachatgptbook.com/

Last updated: Jan 09 2026 at 14:18 UTC