Summary Report COMET Product Development Listening Sessions | December 19, 2024
By Adam Buttrrick, John Chodacki, Juan Pablo Alperin, Maria Praetzellis and Clare Dean
Product Development Outline by COMET convener, Adam Buttrick
Comments aggregated from COMET participant read-ahead feedback and listening sessions on December 19, 2024
Product Development Outline
The following was provided as a read-ahead to the listening sessions conducted on December 19, 2024. Comments on the document are combined with those from the listening session and summarized in the participant comments section.
Introduction
Initial stakeholder consultation at the FORCE11 conference in Los Angeles and the Paris Conference on Open Research Information, combined with lessons learned from ROR (Research Organization Registry), suggest a basic structure for incorporating community enrichments into DOI metadata. While a good starting point, we now need to refine this structure within COMET as a community, developing it from more general ideas to product definitions that can inform both technical discussions and an eventual community call-to-action. What follows is a sketch of this workflow as a starting point for further discussion. Please suggest revisions, add (or answer!) questions, propose new goals, and make revisions or supplementary diagrams. This work will inform future meeting discussions and asynchronous efforts.
We are inviting wide participation in the discussion around product development. Even if your expertise is not technical, your perspectives will help shape our direction and approach to community inclusion as the product development plan evolves. Any and all feedback is welcome!
Diagram
Flow diagram illustrating the proposed COMET model's metadata enrichment process, from initial data ingestion through validation, processing, evaluation, and final metadata generation.
Overview
The current idea for this service is that it ingests enrichment events from multiple distinct sources, described in a shared format, each of which might contain overlapping or conflicting improvements to DOI metadata. These events first undergo a source-agnostic validation layer to check for basic quality and completeness, before being grouped by DOI for cross-source deduplication and conflict resolution. The grouping layer works to ensure that enrichments from different sources are reconciled systematically relative to their respective records. The reconciled enrichment events then pass through an evaluation layer that could potentially make use of automated approval for high-confidence updates or route for manual review. Finally, approved changes are combined with existing records to generate the enriched DOI records.
Outline
1. Source Enrichment Events
Multiple, independent sources can contribute metadata enrichments in a shared format.
Enrichments address various aspects of DOI metadata (affiliations, references, funding information, etc.)
Provenance tracking for all sources and enrichments
Enrichment events from multiple sources are provided in a standardized format
2. Ingestion Pipeline
Source-agnostic validation to check for basic quality and completeness
Routes valid entries for processing and tracks errant entries for logging and resolution
3. Processing and Deduplication
Groups enrichment events by their identifier
Identifies duplications and conflicts when multiple sources provide enrichments for the same aspects of a record
4. Evaluation System
Structured evaluation criteria to assess enrichment quality
Potential automated decision-making based on trust indicators and provenance
Routes complex or unresolvable cases for manual review
Transparency in evaluations (automated and manual review)
5. Metadata Generation
Retrieves current state of DOI metadata
Creates grouped update records based on approval from prior steps enrichments
Validates final updates against source schema
Goals
All enrichment processes are transparent and can be validated by the community
Multiple validation layers
Clear provenance for all enrichments and their sources
Focused on immediate needs and known gaps
Design for use with existing systems and workflows
Broad community participation
Examples
Individual
A dataset's DOI metadata has incomplete funding references. Institutional data managers and grant administrators discover this missing funding metadata during routine compliance tracking and reporting tasks, with each submitting corrections. These submissions might contain overlapping or differently-formatted grant numbers. The service first checks that all submitted funding data corrections are schema-valid representation of the funding references, then deduplicates and combines the independent funding reference updates tied to the single DOI. The proposed COMET model identifies and attempts to reconcile where the same grant identifier appears in multiple formats, either using automated rules for comparison or flagging for expert review. Once the format discrepancy for the grant numbers is resolved, the complete set of funding information is added to the record.
Aggregate
A publisher's collection of open access journal articles has incomplete DOI metadata across their records. A discovery service has enriched these records by combining data from multiple sources: authors, affiliations, and references extracted from their openly available PDFs, ROR IDs assigned to affiliations using a matching service, and funding information derived from agency databases. These enrichments include detailed provenance tracking for each data point, including source systems, extraction methods, and corresponding confidence scores for the extraction or matching methods. The proposed COMET model first validates that all submitted enrichments meet basic quality and schema requirements, organized by DOI. The model then assesses the individual enrichments independently, evaluating them relative to a reliability score associated with the source or process, or by comparing them against the results from alternative enrichment sources (e.g. other ROR ID matching or data extraction services). Cases requiring human judgment - such as conflicting ROR ID assignments for an affiliation - are flagged for expert review. Some enrichments are identified as being valid, while others are discarded as errant. Those that are verified proceed through the system as updates to each DOI record, with a detailed audit trail published of contributing sources and the enrichments found to be valid or rejected.
Open Questions - For Discussion
We invite your questions and comments ahead of the listening sessions Below are some of the questions we might consider ahead of and during our listening sessions. They are not intended to be exhaustive or final. Please propose others and clarify however you think is needed:
Source Integrations
What is required to standardize descriptions of and assess the quality of enrichments across different sources?
What constitutes an enrichment? (Can that be an SDG classification? And does this require the target metadata to be extended?)
What constitutes sufficient provenance for an enrichment event?
What feedback loops should exist between source performance and trust metrics?
What are the procedures for submitting and integrating a metadata source?
What constitutes a suitable/in scope metadata source?
Ingestion and Validation
What constitutes a complete and valid enrichment event?
How should validation and conflict resolution be handled across enrichments, including universal vs. field-specific rules, failure handling, interdependent fields, and criteria for meaningful differences when merging?
What feedback mechanisms should exist between the system and enrichment sources?
Processing
How do we establish identity and equivalence across enrichment events and sources?
How should partial or overlapping enrichments be handled?
How do we handle conflicting information from sources of equal authority?
Are all sources treated equally for all properties of a record? How is differing authority defined/decided?
Evaluation
What factors determine automated approval vs. those that require human review?
What role should community consensus play in evaluation criteria?
How should review decisions be documented and shared?
Management
What versioning and history should be maintained?
How long should enrichment history be retained?
What mechanisms should exist for rollback and correction?
How do we handle deletion or deprecation of previously enriched information?
What mechanisms should be in place to handle appeals (e.g. if an individual or organization has a concern about metadata enriched for an object associated with them)?
Distribution
In what additional formats should enrichments be provided?
How should enriched records be made available to users and their sources?
How do we handle attribution of enriched content?
Impact
How do we monitor and evaluate the overall impact of the enrichment system?
What metrics should be tracked for individual enrichment sources?
How do we measure the value added by different types or categories of enrichments?
How equitably are the benefits of the system realized? Do certain sectors and/or geographic locations experience different levels of gain (or cost)?
Community
How do we ensure participation/representation in the product definition process from across the global scholarly publishing community?
libraries and stakeholder roles really operate so differently in different countries, so we need specificity.
What are the prioritized use cases and specific problems across that global community that COMET is advancing or should focus on?
Use Cases
We had some excellent suggestions related to surfacing use cases in the listening sessions, which we have included below.
Funding information
Add funder ROR IDs or other funding PIDs
Affiliations
Correct affiliations
Add ROR IDs to affiliations
Add references and citations related works
Add usage counts other than citations
Add authors PIDs
ORCID or other name identifier PIDs
National-level name identifiers
Add role per CRediT taxonomy
Add references (via PIDs) to research data, samples, instruments, software …
Add disciplinary information to the object record
Add object resource type, or correct existing resource type when too broad, or suspected as incorrect
Add or disambiguate license information
Add language information
Metadata schema development or the addition of new/supplementary fields
Participant Questions and Comments
Community, diversity, and equity
Questions were raised about what constitutes adequate representation from the global community when it comes to identifying priority use cases, with a suggestion that additional community feedback may be needed before the product can be further defined. Others felt that taskforce members are familiar enough with the overarching needs and gaps in existing metadata to provide a starting point for this work. It was suggested that these concerns could be balanced by building a mapping between members of the community with their specific use cases or needs, taking into account regional specificities and differences in organizational definitions, e.g. that a ‘library’ and its role is very different in the US than it is in Chile.
The importance of identifying equitable pathways for participation in COMET activities was noted, as potential capacity to contribute varies widely among stakeholders. It was further stated that COMET should account for the needs of different types of publishers and venues (independent, those that are scholar-led, or run by volunteers) and their corresponding infrastructure, including taking into account regional variation, with specific emphasis on publishing models in the global south.
Working with publishers
When discussing the extent of publisher involvement in the COMET model, it was noted that most publishers are under-resourced. There may be a number of reasons for gaps in metadata, both social and technical. On the technical side, publishers use many different platforms and service providers and can move between them. Therefore, they inherit any corresponding constraints, they may not be able to make full use of metadata schemas or provide complete descriptions. Likewise, were they to receive many corrections, they may be similarly limited in incorporating them into their data.
Some participants felt that prioritizing direct pursuit of publisher participation would be both complex and time consuming; Instead, it was suggested that building a service that is both easy to use and demonstrates clear value to publishers would make them more likely to participate.
Source enrichment events
It was recommended that events have different types or categories that could allow each to be processed differently. For example, enrichments that make metadata more complete, those that correct records, or those that augment them in some fashion. These categories could then help identify or streamline the specific actions or workflows that are needed for them to be processed.
Participants highlighted that capturing the provenance of the correction is important, as well as there being clear methods for validating said provenance, in order to inform decision making around the use of enrichments.
Additional questions raised included:
Could any metadata be enriched, or would enrichment be limited to certain fields?
Would it be possible to use multiple identifier systems for the same fields? For example, could ROR IDs be used alongside other organizational identifiers for an affiliation?
Should it be possible to maintain multiple, distinct assertions for the same field? For example, should there be stored different and potentially conflicting ROR ID-affiliation strings mappings?
How should we handle specialized forms of assertions such as academic disciplines or research topics? These classifications are typically applied on top of standard/descriptive fields, but represent subjective perspectives that cannot be definitively validated against any single source of truth.
What guidelines and specifications are needed to standardize how enrichments are structured, validated, and processed across different sources and systems?
What is the role of the original depositor of DOI metadata when their records are enriched by the community, and what level of involvement should be required of them in the enrichment process?
How do we identify and differentiate between enrichments that result from human curation versus those generated by automated processes?
Ingestion Pipeline
It was suggested that the ingestion pipeline maintain records of both successful and failed enrichments. By making these results available as structured, open data - including all input data and processing steps - they can serve as trust indicators and to inform future enrichment processing.
It was noted that equal priority should be given to platforms and services that consume (vs. produce) DOI metadata, as they represent key stakeholders and may be more willing to share enrichments or participate in product feedback and refinement.
Processing and Deduplication
It was noted that duplication of enrichments could itself be a valuable signal, as multiple, independent sources submitting the same enrichment could indicate higher confidence in their accuracy.
Evaluation System
It was noted that the methodology for evaluating enrichments will need to be aligned with the processes for submitting the enhancement, so both the criteria and schema for describing enrichments should be developed with reference to real data sources and those that could contribute.
It was commented that trust indicators will likely pose a key challenge for automation/scaling
A question was raised about whether we expect any manual review processes to mirror those currently in use or if there would be significant differences e.g. in decision making, conflict resolution, etc.?
To promote full transparency in any evaluation processes, it was suggested that all details for proposed changes should be preserved and made accessible. Given that the initial volume of changes would be small and that data storage costs are de minimis, providing in this fashion is both practical and aligned with the model’s overall transparency goals.
Some questions arose relating to the systems for evaluating enrichments. These included:
Which organization(s) will oversee and host any curation activities?
What moderation frameworks and policies need to be implemented to guide community contributions?
How will we organize and coordinate the activities of those submitting enrichments to guarantee their quality and consistency?
Metadata Generation
Questions were raised about what should be the final destination for enriched metadata: Would enhanced records be returned to publishers for resubmission to Crossref, integrated into services like OpenAlex, or hosted on a new platform? It was noted that the technical feasibility and overall utility of the proposed COMET model may significantly depend on how and where these enriched records are delivered and how they are used. To maximize utility, there needs to be explicit scoping and definition of these downstream workflows, including from technical and architectural perspectives.
General comments and notes
It was suggested that given the complexity of this proposed model, development should begin with a focused minimum viable product that can be iteratively extended based on community feedback.
A need to establish clear categories of users (such as those who contribute metadata, provide services, or harvest data) was identified to better understand how these stakeholders will interact with and benefit from the system. This categorization will help inform the model's product requirements and value proposition for each user type.
General questions also arose regarding the cost and development of the infrastructure needed to support the effort and how these constraints might determine what use cases will be supported.