Fatcat Guide

Data Model

Entity Types and Ontology

Loosely following "Functional Requirements for Bibliographic Records" (FRBR), but removing the "manifestation" abstraction, and favoring files (digital artifacts) over physical items, the primary bibliographic entity types are:

Note that, compared to many similar bibliographic ontologies, the current one does not have entities to represent:

Each entity type has its own relations and fields (captured in a schema), but there are are also generic operations and fields common across all entities.

Common Entity Fields

All entities have:

The "extra" field is an "escape hatch" to include extra fields not in the regular schema. It is intended to enable gradual evolution of the schema, as well as accommodating niche or field-specific content. Reasonable care should be taken with this extra metadata: don't include large text or binary fields, hundreds of fields, duplicate metadata, etc.

All full entities (distinct from revisions) also have the following fields:

state Vocabulary

Identifiers and Revisions

A specific version of any entity in the catalog is called a "revision". Revisions are generally immutable (do not change and are not editable), and are not normally referred to directly. Instead, persistent "fatcat identifiers" (ident) can be created, which "point to" a single revision at a time. This distinction means that entities referred to by an identifier can change over time (as metadata is corrected and expanded). Revision objects do not "point" back to specific identifiers, so they are not the same as a simple "version number" for an identifier.

Identifiers also have the ability to be merged (by redirecting one identifier to another) and "deleted" (by pointing the identifier to no revision at all). All changes to identifiers are captured as an "edit" object. Edit history can be fetched and inspected on a per-identifier basis.

Controlled Vocabularies

Some individual fields have additional constraints, either in the form of pattern validation ("values must be upper case, contain only certain characters"), or membership in a fixed set of values. These may include:

Other fixed-set "vocabularies" become too large to easily maintain or express in code. These could be added to the backend databases, or be enforced by bots (instead of the system itself). These mostly include externally-registered identifiers or types, such as:

Global Edit Changelog

As part of the process of "accepting" an edit group, a row is written to an immutable, append-only table (which internally is a SQL table) documenting each identifier change. This changelog establishes a monotonically increasing version number for the entire corpus, and should make interaction with other systems easier (eg, search engines, replicated databases, alternative storage backends, notification frameworks, etc.).

Container Entity Reference

Fields

extra Fields

Additional fields used in analytics and "curration" tracking:

For KBART and other "coverage" fields, we "over-count" on the assumption that works with "in-progress" status will soon actually be preserved. Elements of these arrays are either an integer (means that single year is preserved), or an array of length two (meaning everything between the two numbers (inclusive) is preserved).

container_type Vocabulary

publication_status Vocabulary

File Entity Reference

Fields

URL rel Vocabulary

content_scope Vocabulary

This same vocabulary is shared between file, fileset, and webcapture entities; not all the fields make sense for each entity type.

Creator Entity Reference

Fields

extra Fields

All are optional.

Human Names

Representing names of human beings in databases is a fraught subject. For some background reading, see:

Particular difficult issues in the context of a bibliographic database include:

The general guidance for Fatcat is to:

The data model for the creator entity has three name fields:

Names to not necessarily need to expressed in a Latin character set, but also does not necessarily need to be in the native language of the creator or the language of their notable works

Ideally all three fields are populated for all creators.

It seems likely that this schema and guidance will need review.

Fileset Entity Reference

Fields

URL rel types

Any ending in "-base" implies that a file path (from the manifest) can be appended to the "base" URL to get a file download URL. Any "bundle" implies a direct link to an archive or "bundle" (like .zip or .tar) which contains all the files in this fileset

Web Capture Entity Reference

Fields

Warning: This schema is not yet stable.

Release Entity Reference

Fields

External Identifiers (ext_ids)

The ext_ids object name-spaces external identifiers and makes it easier to add new identifiers to the schema in the future.

Many identifier fields must match an internal regex (string syntax constraint) to ensure they are properly formatted, though these checks aren't always complete or correct in more obscure cases.

extra Fields

release_type Vocabulary

This vocabulary is based on the CSL types, with a small number of (proposed) extensions:

An example of a stub might be a paper that gets an extra DOI by accident; the primary DOI should be a full release, and the accidental DOI can be a stub release under the same work. stub releases shouldn't be considered full releases when counting or aggregating (though if technically difficult this may not always be implemented). Other things that can be categorized as stubs (which seem to often end up mis-categorized as full articles in bibliographic databases):

All other CSL types are also allowed, though they are mostly out of scope:

For the purpose of statistics, the following release types are considered "papers":

release_stage Vocabulary

These roughly follow the DRIVER publication version guidelines, with the addition of a retracted status.

Note that in the case of a retraction, the original publication does not get state retracted, only the retraction notice does. The original publication does get a withdrawn_status metadata field set.

When blank, indicates status isn't known, and wasn't inferred at creation time. Can often be interpreted as published, but be careful!

withdrawn_status Vocabulary

Don't know of an existing controlled vocabulary for things like retractions or other reasons for marking papers as removed from publication, so invented my own. These labels should be considered experimental and subject to change.

Note that some of these will apply more to pre-print servers or publishing accidents, and don't necessarily make sense as a formal change of status for a print journal publication.

Any value at all indicates that the release should be considered "no longer published by the publisher or primary host", which could mean different things in different contexts. As some concrete examples, works are often accidentally generated a duplicate DOI; physics papers have been taken down in response to government order under national security justifications; papers have been withdrawn for public health reasons (above and beyond any academic-style retraction); entire journals may be found to be predatory and pulled from circulation; individual papers may be retracted by authors if a serious mistake or error is found; an author's entire publication history may be retracted in cases of serious academic misconduct or fraud.

contribs.role Vocabulary

All other CSL role types are also allowed, though are mostly out of scope for Fatcat:

If blank, indicates that type of contribution is not known; this can often be interpreted as authorship.

More About DOIs

All DOIs stored in an entity column should be registered (aka, should be resolvable from doi.org). Invalid identifiers may be cleaned up or removed by bots.

DOIs should always be stored and transferred in lower-case form. Note that there are almost no other constraints on DOIs (and handles in general): they may have multiple forward slashes, whitespace, of arbitrary length, etc. Crossref has a number of examples of such "valid" but frustratingly formatted strings.

In the Fatcat ontology, DOIs and release entities are one-to-one.

It is the intention to automatically (via bot) create a Fatcat release for every Crossref-registered DOI from an allowlist of media types ("journal-article" etc, but not all), and it would be desirable to auto-create entities for in-scope publications from all registrars. It is not the intention to auto-create a release for every registered DOI. In particular, "sub-component" DOIs (eg, for an individual figure or table from a publication) aren't currently auto-created, but could be stored in "extra" metadata, or on a case-by-case basis.

Work Entity Reference

Works have no fields! They just group releases.


REST API

The Fatcat HTTP API is a read-only API for querying and searching the catalog.

A declarative specification of all API endpoints, JSON data models, and response types is available in OpenAPI 3.1 format. Auto-generated reference documentation is, for now, available at https://scholar.archive.org/docs.

All API traffic is over HTTPS. All endpoints accept and return only JSON serialized content.


Bulk Exports

There are several types of bulk exports and database dumps folks might be interested in:

All exports and dumps get uploaded to the Internet Archive under the "Fatcat Database Snapshots and Bulk Metadata Exports" collection.

Complete Database Dumps

The most simple and complete bulk export. Useful for disaster recovery, mirroring, or forking the entire service. The internal database schema is not stable, so not as useful for longitudinal analysis. These dumps will include edits-in-progress, deleted entities, old revisions, etc, which are potentially difficult or impossible to fetch through the API.

Public copies may have some tables redacted (eg, API credentials).

Dumps are in PostgreSQL pg_dump "tar" binary format, and can be restored locally with the pg_restore command. See ./extra/sql_dumps/ for commands and details. Dumps are on the order of 100 GBytes (compressed) and will grow over time.

Changelog History

These are currently unimplemented; would involve "hydrating" sub-entities into changelog exports. Useful for some mirrors, and analysis that needs to track provenance information. Format would be the public API schema (JSON).

All information in these dumps should be possible to fetch via the public API, including on a feed/streaming basis using the sequential changelog index. All information is also contained in the database dumps.

Identifier Snapshots

Many of the other dump formats are very large. To save time and bandwidth, a few simple snapshot tables can be exported directly in TSV format. Because these tables can be dumped in single SQL transactions, they are consistent point-in-time snapshots.

One format is per-entity identifier/revision tables. These contain active, deleted, and redirected identifiers, with revision and redirect references, and are used to generate the entity dumps below.

Other tables contain external identifier mappings or file hashes.

Release abstracts can be dumped in their own table (JSON format), allowing them to be included only by reference from other dumps. The copyright status and usage restrictions on abstracts are different from other catalog content; see the metadata licensing section for more context. Abstracts are immutable and referenced by hash in the database, so the consistency of these dumps is not as much of a concern as with other exports.

Unlike all other dumps and public formats, the Fatcat identifiers in these dumps are in raw UUID format (not base32-encoded), though this may be fixed in the future.

See ./extra/sql_dumps/ for scripts and details. Dumps are on the order of a couple GBytes each (compressed).

Entity Exports

Using the above identifier snapshots, the Rust fatcat-export program outputs single-entity-per-line JSON files with the same schema as the HTTP API. These might contain the default fields, or be in "expanded" format containing sub-entities for each record.

Only "active" entities are included (not deleted, work-in-progress, or redirected entities).

These dumps can be quite large when expanded (over 100 GBytes compressed), but do not include history so will not grow as fast as other exports over time. Not all entity types are dumped at the moment; if you would like specific dumps get in touch!


Fatcat Identifiers

Fatcat identifiers are semantically meaningless fixed-length random numbers, usually represented in case-insensitive base32 format. Each entity type has its own identifier namespace.

128-bit (UUID size) identifiers encode as 26 characters (but note that not all such strings decode to valid UUIDs), and in the backend can be serialized in UUID columns:

work_rzga5b9cd7efgh04iljk8f3jvz
https://scholar.archive.org/fatcat/work/rzga5b9cd7efgh04iljk8f3jvz

In comparison, 96-bit identifiers would have 20 characters and look like:

work_rzga5b9cd7efgh04iljk
https://scholar.archive.org/fatcat/work/rzga5b9cd7efgh04iljk

and 64-bit:

work_rzga5b9cd7efg
https://scholar.archive.org/fatcat/work/rzga5b9cd7efg

Fatcat identifiers can used to interlink between databases, but are explicitly not intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered" persistent identifiers for general use.

Internal Schema

Internally, identifiers are lightweight pointers to "revisions" of an entity. Revisions are stored in their complete form, not as a patch or difference; if comparing to distributed version control systems (for managing changes to source code), this follows the git model, not the mercurial model.

The entity revisions are immutable once accepted; the editing process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by identifier not revision number. Identifier pointers also support (versioned) deletion and redirects (for merging entities).

Edit objects represent a change to a single entity; edits get batched together into edit groups (like "commits" and "pull requests" in git parlance).

SQL tables look something like this (with separate tables for entity type a la work_revision and work_edit):

entity_ident
    id (uuid)
    current_revision (entity_revision foreign key)
    redirect_id (optional; points to another entity_ident)
    is_live (boolean; whether newly created entity has been accepted)

entity_revision
    revision_id
    <all entity-style-specific fields>
    extra: json blob for schema evolution

entity_edit
    timestamp
    editgroup_id (editgroup foreign key)
    ident (entity_ident foreign key)
    new_revision (entity_revision foreign key)
    new_redirect (optional; points to entity_ident table)
    previous_revision (optional; points to entity_revision)
    extra: json blob for provenance metadata

editgroup
    editor_id (editor table foreign key)
    description
    extra: json blob for provenance metadata

An individual entity can be in the following "states", from which the given actions (transition) can be made:

"WIP, redirect" or "WIP, deleted" are invalid states.

Additional entity-specific columns hold actual metadata. Additional tables (which reference both entity_revision and entity_id foreign keys as appropriate) represent things like authorship relationships (creator/release), citations between works, etc. Every revision of an entity requires duplicating all of these associated rows, which could end up being a large source of inefficiency, but is necessary to represent the full history of an object.


Sources

The core metadata bootstrap sources, by entity type, are:

Initial file metadata and matches (file-to-release) come from earlier Internet Archive matching efforts, and in particular efforts to extra bibliographic metadata from PDFs (using GROBID) and fuzzy match (with conservative settings) to Crossref metadata.

The intent is to continuously ingest and merge metadata from a small number of large (~2-3 million more more records) general-purpose aggregators and catalogs in a centralized fashion, using bots, and then support volunteers and organizations in writing bots to merge high-quality metadata from field or institution-specific catalogs.

Progeny information (where the metadata comes from, or who "makes specific claims") is stored in edit metadata in the data model. Value-level attribution can be achieved by looking at the full edit history for an entity as a series of patches.


Reference Graph (refcat)

In Summer 2021, the first version of a reference graph dataset, named "refcat", was released and integrated into the fatcat web interface. The dataset contains billions of references between papers in the fatcat catalog, as well as partial coverage of references from papers to books, to websites, and from Wikipedia articles to papers. This is a first step towards identifying links and references between scholarly works of all types preserved in archive.org.

The refcat dataset can be downloaded in JSON lines format from the archive.org "Fatcat Database Snapshots and Bulk Metadata Exports" collection, and is released under a CC-0 license for broad reuse. Acknowledgement and attribution for both the aggregated dataset and the original metadata sources is strongly encouraged (see below for provenance notes).

References can be browsed on fatcat on an "outbound" ("References") and "inbound" ("Cited By") basis for individual release entities. There are also special pages for Wikipedia articles ("outbound", such as Internet) and Open Library books ("inbound", such as The Gift). JSON versions of these pages are available, but do not yet represent a stable API.

How It Works

Raw reference data comes from multiple sources (see "provenance" below), but has the common structure of a "source" entity (which could be a paper, Wikipedia article, etc) and a list of raw references. There might be duplicate references for a single "source" work coming from different providers (eg, both Pubmed and Crossref reference lists). The goal is to match as many references as possible to the "target" work being referenced, creating a link from source to target. If a robust match is not found, the "unmatched" reference is retained and displayed in a human readable fashion if possible.

Depending on the source, raw references may be a simple "raw" string in an arbitrary citation style; may have been parsed or structured in fields like "title", "year", "volume", "issue"; might include a URL or identifier like an arxiv.org identifier; or may have already been matched to a specific target work by another party. It is also possible the reference is vague, malformed, mis-parsed, or not even a reference to a specific work (eg, "personal communication"). Based on the available structure, we might be able to do a simple identifier lookup, or may need to parse a string, or do "fuzzy" matching against various catalogs of known works. As a final step we take all original and potential matches, verify the matches, and attempt to de-duplicate references coming from different providers into a list of matched and unmatched references as output. The refcat corpus is the output of this process.

Two dominant modes of reference matching are employed: identifier based matching and fuzzy matching. Identifier based matching currently works with DOI, Arxiv ids, PMID and PMCID and ISBN. Fuzzy matching employs a scalable way to cluster documents (with pluggable clustering algorithms). For each cluster of match candidates we run a more extensive verification process, which yields a match confidence category, ranging from weak over strong to exact. Strong and exact matches are included in the graph.

All the code for this process is available open source:

Metadata Provenance

The provenance for each reference in the index is tracked and exposed via the match_provenance field. A fatcat- prefix to the field means that the reference came through the refs metadata field stored in the fatcat catalog, but originally came from the indicated source. In the absence of fatcat-, the reference was found, updated, or extracted at indexing time and is not recorded in the release entity metadata.

Specific sources:

Note that sources of reference metadata which have formal licensing restrictions, even CC-BY or ODC-BY licenses as used by several similar datasets, are not included in refcat.

Current Limitations and Known Issues

The initial Summer 2021 version of the index has a number of limitations. Feedback on features and coverage are welcome! We expect this dataset to be iterated over regularly as there are a few dimensions along which the dataset can be improved and extended.

The reference matching process is designed to eventually operate in both "batch" and "live" modes, but currently only "batch" output is in the index. This means that references from newly published papers are not added to the index in an ongoing fashion.

Fatcat "release" entities (eg, papers) are matched from a Spring 2021 snapshot. References to papers published after this time will not be linked.

Wikipedia citations come from the dataset Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia, by Singh, West, and Colavizza. This is a one-time corpus based on a May 2020 snapshot of English Wikipedia only, and is missing many current references and citations. Additionally, only direct identifier lookups (eg, DOI matches) are used, not fuzzy metadata matching.

Open Library "target" matches are based on a snapshot of Open Library works, and are matched either ISBN (extracted from citation string) or fuzzy metadata matching.

Crossref references are extracted from a January 2021 snapshot of Crossref metadata, and do not include many updates to existing works.

Hundreds of millions of raw citation strings ("unstructured") have not been parsed into a structured for fuzzy matching. We plan to use GROBID to parse these citation strings, in addition to the current use of GROBID parsing for references from fulltext documents.

The current GROBID parsing used version v0.6.0. Newer versions of GROBID have improved citation parsing accuracy, and we intend to re-parse all PDFs over time. Additional manually-tagged training datasets could improve GROBID performance even further.

In a future update, we intend to add Wayback (web archive) capture status and access links for references to websites (distinct from references to online journal articles or books). For example, references to an online news article or blog post would indicate the closest (in time, to the "source" publication date) Wayback captures to that web page, if available.

References are only displayed on fatcat, not yet on scholar.archive.org.

There is no current or planned mechanism for searching, sorting, or filtering article search results by (inbound) citation count. This would require resource-intensive transformations and continuous re-indexing of search indexes.

It is unclear how the batch-generated refcat dataset and API-editable release refs metadata will interact in the future. The original refs may eventually be dropped from the fatcat API, or at some point the refcat corpus may stabilize and be imported in to fatcat refs instead of being maintained as a separate dataset and index. It would be good to retain a mechanism for human corrections and overrides to the machine-generated reference graph.


Metadata Licensing

The Fatcat catalog content license is the Creative Commons Zero ("CC-0") license, which is effectively a public domain grant. This applies to the catalog metadata itself (titles, entity relationships, citation metadata, URLs, hashes, identifiers), as well as "meta-meta-data" provided by editors (edit descriptions, provenance metadata, etc).

The core catalog is designed to contain only factual information: "this work, known by this title and with these third-party identifiers, is believed to be represented by these files and published under such-and-such venue". As a norm, sourcing metadata (for attribution and provenance) is retained for each edit made to the catalog.

A notable exception to this policy are abstracts, for which no copyright claims or license is made. Abstract content is kept separate from core catalog metadata; downstream users need to make their own decision regarding reuse and distribution of this material.

As a social norm, it is expected (and appreciated!) that downstream users of the public API and/or bulk exports provide attribution, and even transitive attribution (acknowledging the original source of metadata contributed to Fatcat). As an academic norm, researchers are encouraged to cite the corpus as a dataset (when this option becomes available). However, neither of these norms are enforced via the copyright mechanism.


For Publishers

This page addresses common questions and concerns from publishers of research works indexed in Fatcat, as well as the Internet Archive Scholar service built on top of it.

For help in exceptional cases, contact Internet Archive through our usual support channels.

Metadata Indexing

Many publishers will find that metadata records are already included in fatcat if they register persistent identifiers for their research works. This pipeline is based on our automated harvesting of DOI, Pubmed, dblp, DOAJ, and other metadata catalogs. This process can take some time (eg, days from registration), does not (yet) cover all persistent identifiers, and will only cover those works which get identifiers.

For publishers who find that they are not getting indexed in fatcat, our primary advice is to register ISSNs for venues (journals, repositories, conferences, etc), and to register DOIs for all current and back-catalog works. DOIs are the most common and integrated identifier in the scholarly ecosystem, and will result in automatic indexing in many other aggregators in addition to fatcat/scholar. There may be funding or resources available for smaller publishers to cover the cost of DOI registration, and ISSN registration is usually no-cost or affordable through national institutions.

We do not recommend that journal or conference publishers use general-purpose repositories like Zenodo to obtain no-cost DOIs for journal articles. These platforms are a great place for pre-publication versions, datasets, software, and other artifacts, but not for primary publication-version works (in our opinion).

If DOI registration is not possible, one good alternative is to get included in the Directory of Open Access Journals and deposit article metadata there. This process may take some time, but is a good basic indicator of publication quality. DOAJ article metadata is periodically harvested and indexed in fatcat, after a de-duplication process.

Improving Automatic Preservation

In alignment with its mission, Internet Archive makes basic automated attempts to capture and preserve all open access research publications on the public web, at no cost. This effort comes with no guarantees around completeness, timeliness, or support communications.

Preservation coverage can be monitored through the journal-specific dashboards or via the coverage search interface.

There are a few technical things publishers can do to increase their preservation coverage, in addition to the metadata indexing tips above:

Official Preservation

Internet Archive is developing preservation services for scholarly content on the web. Contact us at scholar@archive.org for details.

Existing web archiving services offered to universities, national libraries, and other institutions may already be appropriate for some publications. Check if your affiliated institutions already have an Archive-IT account or other existing relationship with Internet Archive.

Small publishers using Open Journal System (OJS) should be aware of the PKP preservation project.


Presentations

2020 Workshop On Open Citations And Open Scholarly Metadata 2020 - Fatcat (video on archive.org)

2019-10-25 FORCE2019 - Perpetual Access Machines: Archiving Web-Published Scholarship at Scale (video on youtube.com)

Blog Posts And Press

2021-03-09: blog.archive.org - Search Scholarly Materials Preserved in the Internet Archive

2020-09-17 blog.dshr.org - Don't Say We Didn't Warn You

2020-09-15: blog.archive.org - How the Internet Archive is Ensuring Permanent Access to Open Access Journal Articles

2020-02-18 blog.dshr.org - The Scholarly Record At The Internet Archive

2019-04-18 blog.dshr.org - Personal Pods and Fatcat

2018-10-03 blog.dshr.org - Brief Talk At Internet Archive Event

2018-03-05 blog.archive.org - Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation

Background / Bibliography

Brainard, Jeffrey. “Dozens of Scientific Journals Have Vanished from the Internet, and No One Preserved Them.” Science | AAAS. Last modified September 8, 2020. Accessed August 6, 2021. https://www.sciencemag.org/news/2020/09/dozens-scientific-journals-have-vanished-internet-and-no-one-preserved-them.
Chen, Xiaotian. “Embargo, Tasini, and ‘Opted Out’: How Many Journal Articles Are Missing from Full-Text Databases.” Internet Reference Services Quarterly 7, no. 4 (September 2002): 23–34.
Eve, Martin Paul, and Jonathan Gray, eds. Reassembling Scholarly Communications: Histories, Infrastructures, and Global Politics of Open Access. Cambridge, Massachusetts: The MIT Press, 2020.
Ito, Joichi. “Citing Blogs.” Joi Ito’s Web (2018). Accessed March 11, 2019. https://joi.ito.com/weblog/2018/05/28/citing-blogs.html.
Karaganis, Joe, ed. Shadow Libraries: Access to Knowledge in Global Higher Education. Cambridge, MA : Ottawa, ON: The MIT Press ; International Development Research Centre, 2018.
Khabsa, Madian, and C. Lee Giles. “The Number of Scholarly Documents on the Public Web.” PLOS ONE 9, no. 5 (May 9, 2014): e93949.
Knoth, Petr, and Zdenek Zdrahal. “CORE: Three Access Levels to Underpin Open Access.” D-Lib Magazine 18, no. 11/12 (November 2012). Accessed March 11, 2019. http://www.dlib.org/dlib/november12/knoth/11knoth.html.
Kwon, Diana. “More than 100 Scientific Journals Have Disappeared from the Internet.” Nature (September 10, 2020). Accessed August 6, 2021. https://www.nature.com/articles/d41586-020-02610-z.
Laakso, Mikael, Lisa Matthias, and Najko Jahn. “Open Is Not Forever: A Study of Vanished Open Access Journals.” Journal of the Association for Information Science and Technology 72, no. 9 (September 2021): 1099–1112.
Ortega, Jose Luis. Academic Search Enghines: New Information Trends and Services for Scientists on the Web. Chandos information professional series. Philadelphia, PA: Elsevier, 2014.
Page, Roderic. “Notes on Bibliographic Metadata in JSON.” Last modified July 12, 2017. Accessed March 11, 2019. https://github.com/rdmpage/bibliographic-metadata-json.
Pettifer, S., P. McDERMOTT, J. Marsh, D. Thorne, A. Villeger, and T.K. Attwood. “Ceci n’est Pas Un Hamburger: Modelling and Representing the Scholarly Article.” Learned Publishing 24, no. 3 (July 2011): 207–220.
Piwowar, Heather, Jason Priem, Vincent Larivière, Juan Pablo Alperin, Lisa Matthias, Bree Norlander, Ashley Farley, Jevin West, and Stefanie Haustein. “The State of OA: A Large-Scale Analysis of the Prevalence and Impact of Open Access Articles.” PeerJ 6 (February 13, 2018): e4375.
Ramalho, Luciano G. “From ISIS to CouchDB: Databases and Data Models for Bibliographic Records.” The Code4Lib Journal, no. 13 (April 11, 2011). Accessed March 11, 2019. https://journal.code4lib.org/articles/4893.
rclark1. “DOI-like Strings and Fake DOIs.” Website. Crossref. Accessed March 11, 2019. https://www.crossref.org/blog/doi-like-strings-and-fake-dois/.
Svenonius, Elaine. The Intellectual Foundation of Information Organization. First MIT Press paperback ed. Digital libraries and electronic publishing. Cambridge, Mass.: MIT Press, 2009.
Van de Sompel, Herbert, Robert Sanderson, Martin Klein, Michael L. Nelson, Bernhard Haslhofer, Simeon Warner, and Carl Lagoze. “A Perspective on Resource Synchronization.” D-Lib Magazine 18, no. 9/10 (September 2012). Accessed March 11, 2019. http://www.dlib.org/dlib/september12/vandesompel/09vandesompel.html.
Wright, Alex. Cataloging the World: Paul Otlet and the Birth of the Information Age. Oxford ; New York: Oxford University Press, 2014.
“Citation Style Language.” Citation Style Language. Accessed March 11, 2019. https://citationstyles.org/.
“Open Archives Initiative Protocol for Metadata Harvesting.” Accessed March 11, 2019. https://www.openarchives.org/pmh/.