Summary Report COMET Implementation Scenario Listening Sessions | February 18, 2025

By Adam Buttrrick, John Chodacki, Juan Pablo Alperin, Maria Praetzellis and Clare Dean

DOI  10.7269/C1CC7B

  • Implementation Scenario Outline by COMET convener, Adam Buttrick

  • Comments aggregated from COMET participant read-ahead feedback and listening sessions on February 18, 2025

Implementation Scenario Outline

The following was provided as a read-ahead to the listening sessions conducted on 

February 18, 2025. Comments on the document are combined with those from the listening session and summarized in the participant comments section. 

The following scenario outlines how a metadata enrichment service could process and manage enrichments from external providers (i.e. those other than the owners of DOI records) through structured workflows and community-driven governance. While this scenario is one of many possible approaches, it focuses specifically on how the proposed COMET model could evaluate, handle, and validate third-party enrichments from an initial set of providers, rather than prescribing how the enrichments themselves should be generated or the full scope of who should participate in this work.  This scenario is one of many that are possible and is provided to help us consider how the proposed COMET model could work in practice.

Taskforce discussions have helped shape this scenario's core elements, which include:

  • The service is structured as a joint initiative between a membership organization, a university, and a scholarly communications nonprofit.  

  • Adherence to the Principles of Open Scholarly Infrastructure (POSI) to guarantee the service operates with openness, transparency, and pursues long-term sustainability. 

  • Governance through a supervisory board drawn from the partner organizations, alongside a global advisory group, who guide its priorities and ensure broad community participation.

  • An initial focus on addressing gaps in affiliation metadata and the usage of institutional identifiers (e.g., ROR IDs) in DOI records. 

Implementation of a Service

A new service could implement the proposed COMET model through a joint initiative between three partners made up of community members (membership organizations, universities, scholarly communications nonprofits, etc.) that consults an advisory group. The group will solicit metadata from outside sources to enrich DOI records, benefiting from the combined reach and networks of all three organizations. The partners would ideally already be committed to the Principles of Open Scholarly Infrastructure (POSI) and the shared agreement to establish the project should also adhere to POSI. Following these principles, all code and data for the project will be open source and available via permissive licenses (e.g. MIT, CC0).

Setting up governance

From there, the project builds upon existing governance structures of its parent organizations, establishing a supervisory body composed of representatives from each organization. It also sets up a separate, community-focused steering group. Using an open nomination process, it draws from a global pool of experts and stakeholders, who inform the service’s strategic direction. The group also serves a capacity building function, proactively seeking opportunities to support regions and communities that are underserved in the current ecosystem. 

The road to a minimum viable product (MVP)

At the outset, the project leverages existing in-kind contributions (product, outreach, and technical) from its parent organizations to conduct a landscape analysis of current approaches to DOI metadata enrichment and technical feasibility of integrating these sources. This review identifies that pursuing enrichments through a “fields as features” framework - where incremental changes to specific metadata fields are pursued instead of full record updates - is best suited to an expedient, minimum viable product (MVP) rollout. 

As its initial focus, the service could decide to target affiliations, relative to several strategic considerations. Affiliations represent a priority metadata field whose quality and completeness enable tracking of institutional research outputs, collaboration patterns, and funding impacts. In current DOI records from both Crossref and DataCite, affiliation metadata suffers from wide scale gaps and inaccuracies, particularly when it comes to the assignment of persistent identifiers, such as ROR IDs. The field also benefits from extensive and long-standing use of diverse extraction methods from full text sources, including the corresponding production of gold standard test datasets to assess their performance. This makes both comparative and consensus forms of analysis of these methods possible and provides a practical starting point for large-scale, automated enrichment. The same is true for ROR ID matching processes, where the performance for multiple ROR ID matching methods is known, and whose use in supplementing each new or corrected affiliation string multiplies the enrichment’s value. These factors, combined with the relatively bounded scope of affiliation data compared to other metadata fields, make it an ideal starting point for the service's development.

After using affiliations to create an MVP, continued, iterative development would then aim to encompass and enrich additional fields. The same analysis also identifies several operational, technical, and resourcing needs, stemming from the diversity of potential external data sources and the need to address equity and research integrity considerations.

Involving the community in curation

To establish policies, procedures, and assist in the overall operation of the service, and with counsel from the advisory and community groups, the project puts forth a global, open call for participation in a volunteer curation group through their existing membership network and communication channels. This call seeks volunteers - ranging from librarians and repository managers to staff at publishers, government and funding organizations - who have a vested interest in the quality and completeness of DOI metadata and that each bring their own specialized regional or domain expertise to the curation process. An initial group of volunteers is identified. With the community advisory group, they participate in a series of policy development meetings, establishing baseline rules for curation of submitted data, which are then formalized in the form of public facing documentation. Subsequently, new curators can be onboarded through a minimal, but standardized training process that makes use of said documentation and which describes various curation tasks. As the project develops, these policies will then be continuously refined by the project’s curation volunteers, relative to user feedback and operational needs exposed through the real-world processing of submitted data.

Beyond the MVP

Although the parent organizations of the project can repurpose some of their resources for its initial development, it is identified that scaling the project beyond an MVP state will require additional investments. The project will thus apply for external funding to resource its long term data science work, outreach activities, and curation needs where it is identified that the parent organizations cannot provide these resources on a recurring basis.

Receiving Data Enrichments from External Sources

Multiple independent services regularly process large volumes of PDFs to extract affiliation metadata, with each service using its own unique parsing methods and workflows. Each service expresses an intent to contribute enrichments to the project. To begin doing so, they undergo a structured onboarding process. 

The first step involves establishing their organizational identity within the service. Each provider is assigned a profile entry that captures their basic details, described using persistent identifiers (typically a ROR ID for their organization, ORCIDs for any responsible staff) (Figure 1). This profile is then referenced in all submissions.

Next, for any enrichments derived from automated processes, providers must also open source, document, and archive their methodology, deposit their code, models, and benchmarking data in a repository with a corresponding DOI registration. These details are captured in an enrichment service record that is associated with the profile (Figure 2). The enrichment service profile ID is used in all submissions.

Finally, the performance of the enrichment method itself is verified relative to the provided benchmark data and its independent performance on gold-standard datasets for the task. If the enrichment method fails to meet the minimum performance standards, feedback is provided to the external service. If it meets or exceeds the requirements, it is notified and approved and permitted to contribute data to the service. The benchmark datasets used to produce the extraction method are then also reviewed for inclusion in the gold standard sets for the field-specific enrichment task.

 

Figure 1. Example Provider Record

 
 

Figure 2. Example Service Record

 

Creating a Standardized Enrichment Format

Before submitting data, external services transform local enrichments into a shared JSON-based format (Figure 3). Each submission includes a unique submission ID and timestamp, followed by a series of operations describing changes to a work's affiliation metadata. Each operation includes the provider ID, service ID, and work DOI, along with a specific operation type (“add”, “update”, or “delete”). Instead of modifying a field by its position, each operation specifies a target object that identifies the affected creator or contributor associated with the affiliation, using either their ORCID ID or a computed hash of the normalized name. For “update” and “delete” operations, a match object is provided to pinpoint the exact affiliation to modify using another content-based hash, while the new affiliation data is supplied in a data block for “add” actions. Optionally, operations may also include a confidence score to support conflict resolution.


Original:

 
 

Revised:

 

Figure 3. Example Submission Record

Ingestion Pipeline

On receipt of enrichment data, the service runs a set of source-agnostic validation checks. This reviews for properly structured JSON, verifying each submission contains a valid submission ID, timestamp, and an operations array. For each operation, it validates the presence and format of required fields including provider_id, service_id, work_id, operation type (add/update/delete), field type (affiliation/ror_id), value, confidence score, and hashes. Operations with missing or malformed fields are rejected (Figure 4).

Validated submissions then undergo state validation relative to existing records. An enrichment request to add new data is rejected if the specified value already exists for the creator or contributor  for the work. Enrichments to update or delete are rejected if a target value does not exist at the specified index.

All submissions and their ingest outcomes are tracked in a log system that is openly available via an API, allowing external services to see precisely why specific operations were rejected (Figure 5).

Figure 4. Ingest data flow

 

Figure 5. Example ingest log

 

Processing, Conflict Detection, and Mergers

On a specified schedule, the service will group validated operations in a set of submissions by their DOI. If two or more of the external services submit the same or similar data, for example, affiliation strings with minor textual variations, the system detects these duplicates, employing normalization logic as necessary. Duplicate enrichments from independent sources are captured and become a signal of correctness. Both duplicated and enrichments without conflicts are given an approved status.

Where conflicting data emerges (e.g., two submissions assign distinct affiliation strings to the same author in a paper), the service uses heuristics like edit distance measures to assess the true degree of difference between the assertions, qualifies assertions relative to provide confidence scores and known service performance. True conflicts are categorized for secondary evaluation, whereas minor differences can be joined into a merge state, where the assertion from the highest quality source is treated as the canonical form. For true conflicts or where sources are of equivalent quality, the system triggers an escalation by applying a review status to the operations, grouped by the DOI for the work.

Figure 6. Processing data flow with conflict detection and mergers

Figure 7. Example processing log

Ensuring Successful Conflict Resolution

When the automated processing identifies true conflicts between affiliation assertions that cannot be resolved by other means, the service would route these cases to a network of qualified curators for review. Each curator has undergone a standardized onboarding and training process to ensure consistent review practices. The conflicts are presented in a structured interface, to which the curator is authenticated, such that all curation activity is associated with their profile. This interface displays the competing assertions, their sources, and any other contextual information, such as confidence scores for the method used to derive. 

During review, curators follow a standardized decision tree to evaluate conflicts. As affiliation extractions are limited to works published in open access sources, reviewers can access the full PDF, allowing them to verify any assertions against the source material. For affiliation string conflicts, they first verify the true presence of substantive differences vs. minor discrepancies. When evaluating conflicting ROR ID assignments, reviewers check for any ambiguity in the affiliation, such as the inclusion of multiple organizations or an organization name lacking location qualifiers that result in it corresponding to multiple organizations. The curator attempts to resolve these discrepancies using context available in the work, for example, by researching the author associated with the affiliation in canonical sources, such as ORCID profiles that include trust markers or staff pages for matching candidate organizations. If the curator can resolve the conflict, they push the resolution to the system through the structured interface. If initial review is unclear, the curator can request escalation which escalates the review to a senior curator, who makes the final determination about the conflict's resolvability and push this change to the system using the structured interface. 

All review decisions are logged in the system with the identification of the curators and timestamps. These review logs are made publicly available, ensuring full transparency in the review process. As the queue grows, the partners may need to recruit a Curation Coordinator to help manage this process. For an additional example of such a curation system, refer to that used in ROR.

Metadata Generation and Dissemination

When enrichments pass the automated checks or are manually approved through curatorial review, the service generates an annotation record for the DOI (Figure 8). This annotation record allows for the serving of distinct, non-destructive views of the metadata: a "base" view containing only the original deposit, and an "enriched" view that includes the approved changes. Through submission and operational references in the annotation record, each enrichment is linked back to its full provenance information, including the provider and source of the enrichment, any corresponding timestamps, and the full approval chain.

To support transparency and enable use of enrichments, the host organization exposes both views through its public API, providing clear documentation to end users explaining the differences between them. The enriched view adds an "enrichments" field to the API response, which provides an audit trail for each enriched field in the record, mapping the modified fields to its source operations, submission history, and validation status (Figure 9). This view is accessible using API parameters, giving API users flexibility in how they use and consume the enriched data and to avoid disruption to existing API usage.

The original depositor maintains full control over their metadata and can merge or request the removal of individual enrichments at any time. Similarly, if community feedback indicates errors with specific enrichments, they can be flagged for review or removed without affecting the underlying deposit or requiring depositor action.

 

Figure 8. Annotation record example

 

Original:

 
 

Revised:

Figure 9. Example of an enriched DataCite record exposed through a discrete view, incorporates enrichments from the annotation record

 

Figure 10. Full entity relationship diagram for the service

 

The provider, source, validation and annotation records, in addition to the fully enriched record set, are provided through publication of a CC0, versioned data file. The most recent version of this data file is served via a corresponding API, whose structure reflects the entities and data flow outlined in the preceding sections and the entity relationship diagram. Further integration with DOI registration agencies can be optimized based on community needs.

Performance Monitoring

At a set interval or in response to user reports indicating a new pattern of errors, providers and the performance of their enrichment services are re-evaluated. If patterns exist that indicate systematic inaccuracy (e.g., consistently incorrect assertion of index values for affiliations or affiliation-ROR ID assignments), direct engagement with the provider is pursued to resolve. Using a random sample of submissions made during the time period since last assessment, a new performance analysis is conducted, comparing this performance against the provider's historic benchmarks. If a systematic degradation is detected, all of the provider's enrichments are removed from records, and the provider is notified with supporting evidence. The enrichments will only be restored after validation confirms their accuracy. If errors persist after remediation attempts, the provider's status is changed to "review," new submissions from the provider are suspended, and they must re-qualify through new benchmark testing.

Conclusion

This scenario presents one possible way a COMET-inspired model could be implemented. The reality of how such a service would take shape will depend on real-world constraints, community needs, and the evolving landscape of metadata enrichment. There are many ways to approach governance structures, curation processes, and technical workflows, with the correct form for emerging through iteration and collaboration.


Participant Comments

Governance, authority & community

  • Participants considered that relative to the outlined scenario, a governing advisory body should be charged with governing a central place for enrichments and also how data is managed, controlled and acted upon within that context. 

  • Questions about who would be authorized to make assertions were raised by multiple participants. Noting specifically that there may be power dynamics involved in the making of assertions and counter assertions, it was mentioned that there isn’t one way to resolve every conflicting assertion relative to a naive conception of correctness. One participant shared that for work done within their organization, 95% of curation tasks are clear and unarguably related to basic correctness, anticipating that the bulk of curation activities for the proposed COMET model would be similarly able to be resolved, but noted that even still, there are a number of edge cases that would continue to be challenging to resolve.

  • It was emphasized that openness is critical for implementation of the proposed COMET model, as this quality is what allows for building community, facilitating action, and reducing costs. Open, shared data sources and infrastructure help institutions worldwide reduce toil and money spent on duplicative work. It was suggested that the proposed COMET model could both unite different communities and improve efficiencies in this respect.

Trust & sources of truth

  • There were thoughtful exchanges regarding which sources, individuals, and organizations could be entrusted with making improvements, with considerations about trust in evolving geopolitical contexts. The group noted that trustworthiness may change over time, necessitating attention to power dynamics between stakeholders and their motivations when developing resolution mechanisms. These mechanisms should include processes for validating enhancements and addressing conflicts. In the context of this discussion, participants also expressed the importance of ensuring there is global representation within the proposed COMET model's governance.

  • The group considered how to gauge and measure the trustworthiness of enrichment sources. Participants noted that it is equally important for any evaluation criteria used to assess enrichment data to also be openly available, such that there is a good baseline understanding across the ecosystem about from where trust indicators are derived.

  • Questions were raised about additional opportunities for embedding source information within enhancement requests to increase trust and validate origins.

  • Several participant discussions explored the concept of authority and competing sources of truth. One of these exchanges examined the nature of correctness itself—specifically, whether the proposed COMET model should pursue a singular correct answer, or if the curation process should instead accommodate multiple perspectives, allowing different stakeholders to make individual determinations about what they consider to be most accurate or useful.

  • Participants noted that there is pressure in the scholarly communications community for there to be a single, authoritative source of truth, with many currently relying on commercial services, such as SCOPUS, Web of Science, or Dimensions, to fulfill this role. They noted that as the number of players and contributors to the metadata enrichment landscape increases, implementing changes across all systems becomes more challenging and exists in tension with this pressure. The group emphasized that enrichment data would need to be openly available and beneficial to all systems to find traction.

  • Participants explored implementation of the proposed COMET model as an intermediary assertion store that would exist independently from DOI registration agencies. This approach would allow for multiple assertions to coexist, while capturing clear provenance information, without the need for immediate quality assessments for integration into primary records. Some participants, however, questioned whether this approach would achieve widespread adoption without integration paths back to registration agencies and what would be required to incentivize this course of action or contribution of enrichments in its absence.

  • Some participants noted that PID metadata evolves over time, explaining that assertions can thus degrade relative to these changes, challenging the concept of DOI metadata as definitive or stable reference points. Relative to these considerations, participants raised an important question: what happens when a referenced field becomes inaccessible? Should enrichments be reprocessed or additional assertions be made? They emphasized the need to consider these dynamic, long-term interacting systems and address such considerations within both the technical infrastructure and any governance frameworks.

Technical and process

  • Participants considered circumstances where the baseline completeness of DOI metadata may be insufficient to support enrichment, or where it is complete but too ambiguous, such as authors listing only an organization name, without geographic or other contextual information, which is used by multiple organizations. In this context, it was noted that efforts to improve DOI records relative to an ambiguous baseline sometimes fundamentally transform vs enrich them.

  • Related to receiving data enrichments from external sources, a participant suggested the need for making clear, public-facing guidelines available about the required level of entity extraction to support enrichment contributions. It was suggested that this approach could help determine cost-effective solutions for supporting the proposed COMET model, such as smaller local journals benefiting from pre-publication services that guarantee the necessary level of metadata is present, while negotiations with larger publishers could include requirements that accurate and complete metadata be provisioned.

  • Participants discussed the role and importance of manual curation alongside automated enrichment processes. A concern was raised that the scenario seems to privilege automated processes over manually provided corrections, such as when a research organization might assert that they know data is wrong or absent from a given record, but without reference data for comparison. In response to this, it was suggested that manual curation and automated methods could be mutually reinforcing rather than opposing approaches. For example, manually curated enrichments could be automatically checked against extraction methods as an additional indicator of confidence.

  • Participants emphasized that institutional expertise should be factored into how confidence in enrichments is assessed. Within this context, they discussed the challenges of handling cases where an institution claims authorship affiliation that isn't present in the original record, suggesting that details about the source processes could inform confidence scores and play a role in conflict resolution. As a concrete example, a participant noted that their institution performs careful manual curation of DOI metadata data, including affiliations, and expressed hope that such institutionally-curated data would take precedence in cases of conflict. Others pointed out the need for transparency about these institutional curation practices, suggesting that organizations would need to be assigned confidence levels based on the perceived quality of these processes within a shared assessment framework.

  • Participants discussed DOI registration agencies’ willingness to participate in enrichment. The implement scenario describes a non-destructive approach where a registration agency service could create a layer of enrichments that exist alongside the original record through an API, without modifying the record itself. This raised questions about whether the registration agencies would find this valuable and which might be willing to implement such a system.

  • Discussion was also had about the data modeling approach included in the scenario. Concerns were expressed about how updates would target specific records, with participants comparing whether using indexes, ORCID IDs or name hashes as identifiers would be a reliable point of reference given that those identifiers could change in the original metadata. This highlighted broader questions about how point-in-time enrichments should relate to subsequent updates, with several potential approaches suggested: preserving enrichments with staleness flags, reprocessing enrichments when reference points change, or implementing version-linked enrichments tied to specific record versions.

  • A participant raised a question about the complexities around consumption of enriched data by those other than registration agencies. Third parties incorporating enrichment metadata might begin with a base layer different from the DOI record, either starting from or creating additional layers of undocumented assertions as part of incorporating. It was then considered how this more complex set of transformations, where records undergo multiple changes beyond simple consumption of DOI records and enrichments, could (or should) be fed back to and related to the original records and any associated enrichments.

  • Participants expressed concern about potential fraudulent addition enrichments, such as the addition of erroneous affiliation metadata to boost institutional rankings. They noted that while the scenario proposes more monitoring than currently exists in DOI metadata systems, additional safeguards would be valuable. Suggestions included implementing mechanisms for immediate intervention when incorrect data is detected, making enrichments publicly available under CC0 licenses to enable community auditing, distributing verification responsibilities beyond the central service, and incorporating strict terms of service during provider onboarding. Some participants recommended drawing lessons from how platforms like Wikipedia handle similar challenges. The group acknowledged tension between centralized verification and distributed responsibility, suggesting the architecture should support eventual community monitoring even if complete verification exceeds MVP scope.

Miscellaneous

  • It was noted that bibliometric research often faces challenges being acted upon due to inconsistent frames of reference, such as the use of custom or narrowly scoped evaluation datasets for problems relating to the quality and completeness of DOI metadata. A valuable feature of the proposed COMET model could therefore be a community-derived reference framework that allows researchers to interact and assess their work to a shared and standardized frame of metadata improvements, allowing for more actionable and comparative insights.

Previous
Previous

Call to Action from the Collaborative Metadata Enrichment Taskforce (COMET)

Next
Next

Summary Report COMET Governance Listening Sessions | January 23, 2025