Participant Perspectives | Cristina Huidiu, Wageningen University & Research Library

DOI 10.7269/C1QP4V

Please explain a little about your background and why you’re interested in persistent identifier (PID) metadata and its enrichment.

I have a background in bibliometrics and data product management having worked in academic libraries as well as within companies building products for the academic community. I hold the strong belief that open data and open infrastructure pose incredible value if key aspects are streamlined and PIDs are a very important aspect.

What excites you most about the potential for collaborative enrichment of PID metadata? What do you think will be the most challenging aspect to address?

We work within a complex and fragmented environment with a newly added layer of volatility with regards to the fast advancements in generative AI, a technology that can be of fantastic help, but where there are also already cases of it being leveraged without fully understanding its consequences. Alignment in such an environment can only be achieved by working together and by bringing together divergent perspectives. Specifically for the process of enrichment of PID metadata, we are only scratching the surface of what could be possible, and I am most excited about working towards seizing these new opportunities. In the open metadata space, tracking data lineage and enrichments across multiple actors allows for the creation of effective feedback loops for metadata quality. These enrichments serve as signals about current metadata quality levels, and when visible in open infrastructure, they indicate where action is needed or where work is already underway. Monitoring and directing resources relative to these signals can result in cost savings, the prevention of duplicate work, and create more time and opportunities to develop new services. For example, we could think of building complex trust markers for journals rather than defaulting to journal lists, using complex metadata built into retrieval augmented generation (RAG) set-ups and training complex reasoning models at a significantly lower cost.

A transparent ecosystem also inspires trust, and it cannot happen without persistent identifiers at its core. Regarding the most challenging aspects, I see two; first, there is the technical aspect, all the different infrastructures have been built at different points in time and consuming real time feedback, capturing field level provenance, exception handling, human in the loop processes and making all of this transparent to everyone might not come too easy for everyone involved; second is the need for upskilling on areas related to this work such as data engineering.

What successful examples of community collaboration in scholarly infrastructure have you witnessed that could inform the proposed COMET model’s development?

Rather than naming specific examples I would like to point out the incredibly collaborative, passionate and determined community we have within the academic space and it’s that spirit that has brought us here and will be key to our success.

How could better and more complete PID metadata, derived from the proposed COMET model, help to advance your goals, those of your organization, or your communities?

Looking at complex strategic analyses surrounding research impact, especially as we aim to step into the realm of predictive analytics and its relationship to generative AI , we need trusted data and a level of transparency for enrichments that we don’t currently have. Ultimately, that’s where I see the impact of COMET, a higher return on investment for our efforts in open data and open infrastructures, less time spent on manual curation of basic metadata, more time spent on building a more complex research graph, with new services being built on top of its trusted, transparent data. I am being particularly careful not to say ‘good data’ as I don’t necessarily believe in that concept as an overarching label. I do think there could be good data for a specific use case and we need to be transparent and aware about any biases and limitations when we make decisions to include one source or another.

What benefits do you envision enriched PID metadata enrichments, such as is being aimed for through COMET, will have on the broader research ecosystem?

Probably the biggest benefit will be creating bridges between all the different infrastructures and starting to eliminate inefficiencies in the process. 

An interesting long-term effect that I am curious about is to see the kind of start-ups that will appear in the future based on COMET’s enrichment data. We already see commercial products built on top of open research output metadata and if we continuously raise the bar for what’s openly available, it will be interesting to see how commercial services might be leveraged in the  future. In addition, making open metadata a viable and easy to reach option for libraries across the world would not only enable new services, but also the choice of using an open or commercial solution.

Why do you think organizations interested in PID metadata enrichment should consider contributing resources to fund the first phase of development for the proposed COMET model?

PIDs are at the center of enabling sustainable, trackable, transparent data curation as well as enabling us to avoid building yet another siloed tool targeting a hyper-specific problem. I also see being involved in COMET as an empowerment step especially for universities, where we take control over how data about us and our research output is represented in open infrastructures. With COMET, we also have  an opportunity to work towards an equitable, open infrastructure that benefits organizations without the resources to actively participate at this time.

Previous
Previous

Participant Perspectives | Howard Ratner, CHORUS

Next
Next

Participant Perspectives | Kyle Demes from OurResearch