Fatcat Guide
Data Model
Entity Types and Ontology
Loosely following "Functional Requirements for Bibliographic Records" (FRBR), but removing the "manifestation" abstraction, and favoring files (digital artifacts) over physical items, the primary bibliographic entity types are:
-
work
: representing an abstract unit of creative output. Does not contain any metadata itself; used only to grouprelease
entities. For example, a journal article could be posted as a pre-print, published on a journal website, translated into multiple languages, and then re-published (with minimal changes) as a book chapter; these would all be variants of the samework
. -
release
: a specific "release" or "publicly published" version of a work. Contains traditional bibliographic metadata (title, date of publication, media type, language, etc). Has relationships to other entities:- child of a single
work
(required) - multiple
creator
entities as "contributors" (authors, editors) - outbound references to multiple other
release
entities - member of a single
container
, for example a journal or book series
- child of a single
-
file
: a single concrete, fixed digital artifact; a manifestation of one or morereleases
. Machine-verifiable metadata includes file hashes, size, and detected file format. Verified URLs link to locations on the open web where this file can be found or has been archived. Has relationships:- multiple
release
entities that this file is a complete manifestation of (almost always a single release)
- multiple
-
fileset
: a list of muliple concrete files, together forming completerelease
manifestation. Primarily intended for datasets and supplementary materials; could also contain a paper "package" (source file and figures). -
webcapture
: a single snapshot (point in time) of a webpage or small website (multiple pages) which are a complete manifestation of arelease
. Not a landing page or page referencing the release. -
creator
: persona (pseudonym, group, or specific human name) that has contributed to one or morerelease
. Not necessarily one-to-one with a human person. -
container
(aka "venue", "serial", "title"): a grouping of releases from a single publisher.
Note that, compared to many similar bibliographic ontologies, the current one does not have entities to represent:
- physical artifacts, either generically or specific copies
- funding sources
- publishing entities
- "events at a time and place"
Each entity type has its own relations and fields (captured in a schema), but there are are also generic operations and fields common across all entities.
Common Entity Fields
All entities have:
extra
(dict, optional): free-form JSON metadata
The "extra" field is an "escape hatch" to include extra fields not in the regular schema. It is intended to enable gradual evolution of the schema, as well as accommodating niche or field-specific content. Reasonable care should be taken with this extra metadata: don't include large text or binary fields, hundreds of fields, duplicate metadata, etc.
All full entities (distinct from revisions) also have the following fields:
-
state
(string, read-only): summarizes the status of the entity in the catalog. One of a small number of fixed values, see vocabulary below. ident
(string, Fatcat identifier, read-only): the Fatcat entity identifier-
revision
(string, UUID): the current revision record that this entityident
points to -
redirect
(string, Fatcat identifier, optional): if set, this entity ident has been redirected to theredirect
one. This is a mechanism of merging or "deduplicating" entities. -
edit_extra
(dict, optional): not part of the bibliographic schema, but can be included when creating or updating entities, and the contents of field will be included in the entity's edit history.
state
Vocabulary
active
: entity exists in the catalogredirect
: the entityident
exists in the catalog, but is a redirect to another entity ident.-
deleted
: an entity with theident
did exist in the catalog previously, but it was deleted. Theident
is retained as a "tombstone" record (aka, there is a record that an entity did exist previously). -
wip
("Work in Progress"): an entity identifier has been created as part of an editgroup, but that editgroup has not been accepted yet into the catalog, and there is no previous/current version of the entity.
Identifiers and Revisions
A specific version of any entity in the catalog is called a "revision".
Revisions are generally immutable (do not change and are not editable), and are
not normally referred to directly. Instead, persistent "fatcat identifiers"
(ident
) can be created, which "point to" a single revision at a time. This
distinction means that entities referred to by an identifier can change over
time (as metadata is corrected and expanded). Revision objects do not "point"
back to specific identifiers, so they are not the same as a simple "version
number" for an identifier.
Identifiers also have the ability to be merged (by redirecting one identifier to another) and "deleted" (by pointing the identifier to no revision at all). All changes to identifiers are captured as an "edit" object. Edit history can be fetched and inspected on a per-identifier basis.
Controlled Vocabularies
Some individual fields have additional constraints, either in the form of pattern validation ("values must be upper case, contain only certain characters"), or membership in a fixed set of values. These may include:
- license and open access status
- work "types" (article vs. book chapter vs. proceeding, etc)
- contributor types (author, translator, illustrator, etc)
- human languages
- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifiers themselves)
Other fixed-set "vocabularies" become too large to easily maintain or express in code. These could be added to the backend databases, or be enforced by bots (instead of the system itself). These mostly include externally-registered identifiers or types, such as:
- file mimetypes
- identifiers themselves (DOI, ORCID, etc), by checking for registration against canonical APIs and databases
Global Edit Changelog
As part of the process of "accepting" an edit group, a row is written to an immutable, append-only table (which internally is a SQL table) documenting each identifier change. This changelog establishes a monotonically increasing version number for the entire corpus, and should make interaction with other systems easier (eg, search engines, replicated databases, alternative storage backends, notification frameworks, etc.).
Container Entity Reference
Fields
name
(string, required): The title of the publication, as used in international indexing services. Eg, "Journal of Important Results". Not necessarily in the native language, but also not necessarily in English. Alternative titles (and translations) can be stored in "extra" metadata (see below)container_type
(string): eg, journal vs. conference vs. book series. Controlled vocabulary is described below.publication_status
(string): whether actively publishing, never published anything, or discontinued. Controlled vocabularity is described below.publisher
(string): The name of the publishing organization. Eg, "Society of Curious Students".issnl
(string): an external identifier, with registration controlled by the ISSN organization. Registration is relatively inexpensive and easy to obtain (depending on world region), so almost all serial publications have one. The ISSN-L ("linking ISSN") is one of either the print (issp
) or electronic (issne
) identifiers for a serial publication; not all publications have both types of ISSN, but many do, which can cause confusion. The ISSN master list is not gratis/public, but the ISSN-L mapping is.issne
(string): Electronic ISSN ("ISSN-E")issnp
(string): Print ISSN ("ISSN-P")wikidata_qid
(string): external linking identifier to a Wikidata entity.
extra
Fields
abbrev
(string): a commonly used abbreviation for the publication, as used in citations, following the [ISO 4][] standard. Eg, "Journal of Polymer Science Part A" -> "J. Polym. Sci. A"acronym
(string): acronym of publication name. Usually all upper-case, but sometimes a very terse, single-word truncated form of the name (eg, a pun).coden
(string): an external identifier, the [CODEN code][]. 6 characters, all upper-case.default_license
(string, slug): short name (eg, "CC-BY-SA") for the default/recommended license for works published in this containeroriginal_name
(string): native name (ifname
is translated)platform
(string): hosting platform: OJS, wordpress, scielo, etcmimetypes
(array of string): formats that this container publishes all works under (eg, 'application/pdf', 'text/html')first_year
(integer): first year of publicationlast_year
(integer): final year of publication (implies that container is no longer active)languages
(array of strings): ISO codes; the first entry is considered the "primary" language (if that makes sense)country
(string): ISO abbreviation (two characters) for the country this container is published inaliases
(array of strings): significant alternative names or abbreviations for this container (not just capitalization/punctuation)region
(string, slug): continent/world-region (vocabulary is TODO)discipline
(string, slug): highest-level subject aread (vocabulary is TODO)urls
(array of strings): known homepage URLs for this container (first in array is default)issnp
(deprecated; string): Print ISSN; deprecated now that there is a top-level fieldissne
(deprecated; string): Electronic ISSN; deprecated now that there is a top-level field
Additional fields used in analytics and "curration" tracking:
doaj
(object)as_of
(string, ISO datetime): datetime of most recent check; if not set, not actually in DOAJseal
(bool): has DOAJ sealwork_level
(bool): whether work-level publications are registered with DOAJarchive
(array of strings): preservation archives
road
(object)as_of
(string, ISO datetime): datetime of most recent check; if not set, not actually in ROAD
kbart
(object)lockss
,clockss
,portico
,jstor
etc (object)year_spans
(array of arrays of integers (pairs)): year spans (inclusive) for which the given archive has preserved this containervolume_spans
(array of arrays of integers (pairs)): volume spans (inclusive) for which the given archive has preserved this container
sherpa_romeo
(object):color
(string): the SHERPA/RoMEO "color" of the publisher of this container
doi
: TODO: include list of prefixes and which (if any) DOI registrar is useddblp
(object):prefix
(string): prefix of dblp keys published as part of this container (eg, 'journals/blah' or 'conf/xyz')
ia
(object): Internet Archive specific fieldssim
(object): same format askbart
preservation above; coverage in microfilm collectionlongtail
(bool): is this considered a "long-tail" open access venue
publisher_type
(string): controlled vocabulary
For KBART and other "coverage" fields, we "over-count" on the assumption that works with "in-progress" status will soon actually be preserved. Elements of these arrays are either an integer (means that single year is preserved), or an array of length two (meaning everything between the two numbers (inclusive) is preserved).
container_type
Vocabulary
journal
proceedings
conference-series
book-series
blog
magazine
trade
test
publication_status
Vocabulary
active
: ongoing publication of new releasessuspended
: publication has stopped, but may continue in the futurediscontinued
: publication has permanently ceasedvanished
: publication has stopped, and public traces have vanished (eg, publisher website has disappeared with no notice)never
: no works were ever published under this containerone-time
: releases were all published as a one-time even. for example, a single instance of a conference, or a fixed-size book series
File Entity Reference
Fields
size
(integer, positive, non-zero): Size of file in bytes. Eg: 1048576.md5
(string): MD5 hash in lower-case hex. Eg: "d41efcc592d1e40ac13905377399eb9b".sha1
(string): SHA-1 hash in lower-case hex. Not technically required, but the most-used of the hash fields and should always be included. Eg: "f013d66c7f6817d08b7eb2a93e6d0440c1f3e7f8".sha256
: SHA-256 hash in lower-case hex. Eg: "a77e4c11a57f1d757fca5754a8f83b5d4ece49a2d28596889127c1a2f3f28832".urls
: An array of "typed" URLs. Order is not meaningful, and may not be preserved.url
(string, required): Eg: "https://example.edu/~frau/prcding.pdf".rel
(string, required): Eg: "webarchive", see vocabulary below.
mimetype
(string): Format of the file. If XML, specific schema can be included after a+
. Example: "application/pdf"content_scope
(string): for situations where the file does not simply contain the full representation of a work (eg, fulltext of an article, for anarticle-journal
release), describes what that scope of coverage is. Eg, entireissue
,corrupt
file. See vocabulary below.release_ids
(array of string identifiers): references torelease
entities that this file represents a manifestation of. Note that a single file can contain multiple release references (eg, a PDF containing a full issue with many articles), and that a release will often have multiple files (differing only by watermarks, or different digitizations of the same printed work, or variant MIME/media types of the same published work).extra
(object with string keys): additional metadata about this filepath
: filename, with optional path prefix. path must be "relative", not "absolute", and should use UNIX-style forward slashes, not Windows-style backward slashes
URL rel
Vocabulary
web
: generic public web sites; forhttp/https
URLs, this should be the defaultwebarchive
: full URL to a resource in a long-term web archiverepository
: direct URL to a resource stored in a repository (eg, an institutional or field-specific research data repository)academicsocial
: academic social networks (such as academia.edu or ResearchGate)publisher
: resources hosted on publisher's websiteaggregator
: fulltext aggregator or search engine, like CORE or Semantic Scholardweb
: content hosted on distributed/decentralized web protocols, such asdat://
oripfs://
URLs
content_scope
Vocabulary
This same vocabulary is shared between file, fileset, and webcapture entities; not all the fields make sense for each entity type.
- if not set, assume that the artifact entity is valid and represents a complete copy of the release
issue
: artifact contains an entire issue of a serial publication (eg, issue of a journal), representing several releases in fullabstract
: contains only an abstract (short description) of the release, not the release itself (unless therelease_type
itself isabstract
, in which case it is the entire release)index
: index of a journal, or series of abstracts from a conferenceslides
: slide deck (usually in "landscape" orientation)front-matter
: non-article content from a journal, such as editorial policiessupplement
: usually a file entity which is a supplement or appendix, not the entire workcomponent
: a sub-component of a release, which may or may not be associated with acomponent
release entity. For example, a single figure or table as part of an articleposter
: digital copy of a poster, eg as displayed at conference poster sessionssample
: a partial sample of the entire work. eg, just the first page of an article. distinct fromtruncated
truncated
: the file has been truncated at a binary level, and may also be corrupt or invalid. distinct fromsample
corrupt
: broken, mangled, or corrupt file (at the binary level)stub
: any other out-of-scope artifact situations, where the artifact represents something which would not link to any possible in-scope release in the catalog (except astub
release)landing-page
: for webcapture, the landing page of a work, as opposed to the work itselfspam
: content is spam. articles, webpages, or issues which include incidental advertisements within them are not counted asspam
Creator Entity Reference
Fields
display_name
(string, required): Full name, as will be displayed in user interfaces. Eg, "Grace Hopper"given_name
(string): Also known as "first name". Eg, "Grace".surname
(string): Also known as "last name". Eg, "Hooper".orcid
(string): external identifier, as registered with ORCID.wikidata_qid
(string): external linking identifier to a Wikidata entity.
extra
Fields
All are optional.
also-known-as
(list of objects): additional names that this creator may be known under. For example, previous names, aliases, or names in different scripts. Can include any or all ofdisplay_name
,given_name
, orsurname
as keys.
Human Names
Representing names of human beings in databases is a fraught subject. For some background reading, see:
- Falsehoods Programmers Believe About Names (blog post)
- Personal names around the world (W3C informational)
- Hubert Blaine Wolfeschlegelsteinhausenbergerdorff Sr. (Wikipedia article)
Particular difficult issues in the context of a bibliographic database include:
- the non-universal concept of "family" vs. "given" names and their relationship to first and last names
- the inclusion of honorary titles and other suffixes and prefixes to a name
- the distinction between "preferred", "legal", and "bibliographic" names, or other situations where a person may not wish to be known under the name they are commonly referred
- language and character set issues
- different conventions for sorting and indexing names
- the sprawling world of citation styles
- name changes
- pseudonyms, anonymous publications, and fake personas (perhaps representing a group, like Bourbaki)
The general guidance for Fatcat is to:
- not be a "source of truth" for representing a persona or human being; ORCID and Wikidata are better suited to this task
- represent author personas, not necessarily 1-to-1 with human beings
- balance the concerns of readers with those of the author
- enable basic interoperability with external databases, file formats, schemas, and style guides
- when possible, respect the wishes of individual authors
The data model for the creator
entity has three name fields:
surname
andgiven_name
: needed for "aligning" with external databases, and to export metadata to many standard formatsdisplay_name
: the "preferred" representation for display of the entire name, in the context of international attribution of authorship of a written work
Names to not necessarily need to expressed in a Latin character set, but also does not necessarily need to be in the native language of the creator or the language of their notable works
Ideally all three fields are populated for all creators.
It seems likely that this schema and guidance will need review.
Fileset Entity Reference
Fields
-
manifest
(array of objects): each entry represents a filepath
(string, required): relative path to file (including filename)size
(integer, required): in bytesmd5
(string): MD5 hash in lower-case hexsha1
(string): SHA-1 hash in lower-case hexsha256
(string): SHA-256 hash in lower-case hexmimetype
(string): Content type in MIME type schemaextra
(object): any extra metadata about this specific file. all are optionaloriginal_url
: live web canonical URL to download this filewebarchive_url
: web archive capture of this file
-
urls
: An array of "typed" URLs. Order is not meaningful, and may not be preserved. These are URLs for the entire fileset, not individual files.url
(string, required): Eg: "https://example.edu/~frau/prcding.pdf".rel
(string, required): Eg: "archive-base", "webarchive".
-
release_ids
(array of string identifiers): references torelease
entities -
content_scope
(string): for situations where the fileset does not simply contain the full representation of a work (eg, all files in dataset, for adataset
release), describes what that scope of coverage is. Uses same vocabulary as File entity. -
extra
(object with string keys): additional metadata about this group of files, including upstream platform-specific metadata and identifiersplatform_id
: platform-specific identifier for this fileset
URL rel
types
Any ending in "-base" implies that a file path (from the manifest) can be
appended to the "base" URL to get a file download URL. Any "bundle" implies a
direct link to an archive or "bundle" (like .zip
or .tar
) which contains
all the files in this fileset
repository
orplatform
orweb
: URL of a live-web landing page or other location where content can be found. May or may not be machine-reachable.webarchive
: web archive version ofrepository
landing pagerepository-bundle
: direct URL to a live-web "archive" file, such as.zip
, which contains all of the individual files in this filesetwebarchive-bundle
: web archive version ofrepository-bundle
archive-bundle
: file archive version ofrepository-bundle
repository-base
: live-web base URL/directory from which filepath
can be appended to fetch individual filesarchive-base
: base URL/directory from which filepath
can be appended to fetch individual files
Web Capture Entity Reference
Fields
Warning: This schema is not yet stable.
cdx
(array of objects): each entry represents a distinct web resource (URL). First is considered the primary/entry. Roughly aligns with CDXJ schema.surt
(string, required): sortable URL formattimestamp
(string, datetime, required): ISO format, UTC timezone, withZ
prefix required, with second (or finer) precision. Eg, "2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should be converted naively.url
(string, required): full URLmimetype
(string): content type of the resourcestatus_code
(integer, signed): HTTP status codesha1
(string, required): SHA-1 hash in lower-case hexsha256
(string): SHA-256 hash in lower-case hex
archive_urls
: An array of "typed" URLs where this snapshot can be found. Can be wayback/memento instances, or direct links to a WARC file containing all the capture resources. Often will only be a single archive. Order is not meaningful, and may not be preserved.url
(string, required): Eg: "https://example.edu/~frau/prcding.pdf".rel
(string, required): Eg: "wayback" or "warc"
original_url
(string): base URL of the resource. May reference a specific CDX entry, or maybe in normalized form.timestamp
(string, datetime): same format as CDX line timestamp (UTC, etc). Corresponds to the overall capture timestamp. Can be the earliest of CDX timestamps if that makes sensecontent_scope
(string): for situations where the webcapture does not simply contain the full representation of a work (eg, HTML fulltext, for anarticle-journal
release), describes what that scope of coverage is. Eg,landing-page
it doesn't contain the full content. Landing pages are out-of-scope for fatcat, but if they were accidentally imported, should mark them as such so they aren't re-imported. Uses same vocabulary as File entity.release_ids
(array of string identifiers): references torelease
entities
Release Entity Reference
Fields
title
(string, required): the display title of the release. May include subtitle.subtitle
(string): intended only to be used primarily with books, not journal articles. Subtitle may also be appended to thetitle
instead of populating this field.original_title
(string): the full original language title, iftitle
is translatedwork_id
(fatcat identifier; required): the (single) work that this release is grouped under. If not specified in a creation (POST
) action, the API will auto-generate a work.container_id
(fatcat identifier): a (single) container that this release is part of. When expanded thecontainer
field contains the fullcontainer
entity.release_type
(string, controlled set): represents the medium or form-factor of this release; eg, "book" versus "journal article". Not necessarily the same across all releases of a work. See definitions below.release_stage
(string, controlled set): represents the publishing/review lifecycle status of this particular release of the work. See definitions below.release_date
(string, ISO date format): when this release was first made publicly available. Blank if only year is known.release_year
(integer): year when this release was first made publicly available; should matchrelease_date
if both are known.withdrawn_status
(optional, string, controlled set):withdrawn_date
(optional, string, ISO date format): when this release was withdrawn. Blank if only year is known.withdrawn_year
(optional, integer): year when this release was withdrawn; should matchwithdrawn_date
if both are known.ext_ids
(key/value object of string-to-string mappings): external identifiers. At least an emptyext_ids
object is always required for release entities, so individual identifiers can be accessed directly.volume
(string): optionally, stores the specific volume of a serial publication this release was published in. type: stringissue
(string): optionally, stores the specific issue of a serial publication this release was published in.pages
(string): the pages (within a volume/issue of a publication) that this release can be looked up under. This is a free-form string, and could represent the first page, a range of pages, or even prefix pages (like "xii-xxx").version
(string): optionally, describes distinguishes this release version from others. Generally a number, software-style version, or other short/slug string, not a freeform description. Book "edition" descriptions can also go in anedition
extra field. Often used in conjunction with external identifiers. If you're not certain, don't use this field!number
(string): an inherent identifier for this release (or work), often part of the title. For example, standards numbers, technical memo numbers, book series number, etc. Not a bookchapter
number however (which can be stored inextra
). Depending on field or series-specific norms, the number may be stored here, in the title, or in both fields.publisher
(string): name of the publishing entity. This does not need to be populated if the associatedcontainer
entity has the publisher field set, though it is acceptable to duplicate, as the publishing entity of a container may differ over time. Should be set for singleton releases, like books.language
(string, slug): the primary language used in this particular release of the work. Only a single language can be specified; additional languages can be stored in "extra" metadata (TODO: which field?). This field should be a valid RFC1766/ISO639 language code (two letters). AKA, a controlled vocabulary, not a free-form name of the language.license_slug
(string, slug): the license of this release. Usually a creative commons short code (eg,CC-BY
), though a small number of other short names for publisher-specific licenses are included (TODO: list these).contribs
(array of objects): an array of authorship and othercreator
contributions to this release. Contribution fields include:index
(integer, optional): the (zero-indexed) order of this author. Authorship order has significance in many fields. Non-author contributions (illustration, translation, editorship) may or may not be ordered, depending on context, but index numbers should be unique per release (aka, there should not be "first author" and "first translator")creator_id
(identifier): if known, a reference to a specificcreator
raw_name
(string): the name of the contributor, as attributed in the text of this work. If thecreator_id
is linked, this may be different from thedisplay_name
; if a creator is not linked, this field is particularly important. Syntax and name order is not specified, but most often will be "display order", not index/alphabetical (in Western tradition, surname followed by given name).role
(string, of a set): the type of contribution, from a controlled vocabulary. TODO: vocabulary needs review.extra
(string): additional context can go here. For example, author affiliation, "this is the corresponding author", etc.
refs
(array of ident strings): references (aka, citations) to other releases. References can only be linked to a specific target release (not a work), though it may be ambiguous which release of a work is being referenced if the citation is not specific enough. IMPORTANT: release refs are distinct from the reference graph API. Reference fields include:index
(integer, optional): reference lists and bibliographies almost always have an implicit order. Zero-indexed. Note that this is distinct from thekey
field.target_release_id
(fatcat identifier): if known, and the release exists, a cross-reference to the Fatcat entityextra
(JSON, optional): additional citation format metadata can be stored here, particularly if the citation schema does not align. Common fields might be "volume", "authors", "issue", "publisher", "url", and external identifiers ("doi", "isbn13").key
(string): works often reference works with a short slug or index number, which can be captured here. For example, "[BROWN2017]". Keys generally supersede theindex
field, though both can/should be supplied.year
(integer): year of publication of the cited release.container_title
(string): if applicable, the name of the container of the release being cited, as written in the citation (usually an abbreviation).title
(string): the title of the work/release being cited, as written.locator
(string): a more specific reference into the work/release being cited, for example the page number(s). For web reference, store the URL in "extra", not here.
abstracts
(array of objects): see belowsha1
(string, hex, required): reference to the abstract content (string). Example: "3f242a192acc258bdfdb151943419437f440c313"content
(string): The abstract raw content itself. Example:<jats:p>Some abstract thing goes here</jats:p>
mimetype
(string): not formally required, but should effectively always get set.text/plain
if the abstract doesn't have a structured formatlang
(string, controlled set): the human language this abstract is in. See thelang
field of release for format and vocabulary.
External Identifiers (ext_ids
)
The ext_ids
object name-spaces external identifiers and makes it easier to
add new identifiers to the schema in the future.
Many identifier fields must match an internal regex (string syntax constraint) to ensure they are properly formatted, though these checks aren't always complete or correct in more obscure cases.
doi
(string): full DOI number, lower-case. Example: "10.1234/abcde.789". See section below for more about DOIs specificallywikidata_qid
(string): external identifier for Wikidata entities. These are integers prefixed with "Q", like "Q4321". Eachrelease
entity can be associated with at most one Wikidata entity (this field is not an array), and Wikidata entities should be associated with at most a singlerelease
. In the future it may be possible to associate Wikidata entities withwork
entities instead.isbn13
(string): external identifier for books. ISBN-9 and other formats should be converted to canonical ISBN-13.pmid
(string): external identifier for the PubMed database. These are bare integers, but stored in a string format.pmcid
(string): external identifier for PubMed Central database. These are integers prefixed with "PMC" (upper case), like "PMC4321". Versioned PMCIDs can also be stored (eg, "PMC4321.1"; future clarification of whether versions should always be stored will be needed.core
(string): external identifier for the CORE open access aggregator. Not used much in practice. These identifiers are integers, but stored in string format.arxiv
(string): external identifier to a (version-specific) arxiv.org work. For releases, must always include thevN
suffix (eg,v3
).jstor
(string): external identifier for works in JSTOR which do not have a valid registered DOI.ark
(string): ARK identifier.mag
(DEPRECATED; string): Microsoft Academic Graph (MAG) identifier. As of December 2021, no entities in the catalog have a value for this field.doaj
(string): DOAJ article-level identifierdblp
(string): dblp article-level identifieroai
(string): OAI-PMH record id. Only use if no other identifier is availablehdl
(string): handle.net identifier. While DOIs are technically handles, do not put DOIs in this field. Handles are normalized to lower-case in the catalog (server-side).
extra
Fields
crossref
(object), for extra crossref-specific metadatasubject
(array of strings) for subject/category of contenttype
(string) raw/original Crossref typealternative-id
(array of strings)archive
(array of strings), indicating preservation services depositedfunder
(object/dictionary)
aliases
(array of strings) for additional titles this release might be known bycontainer_name
(string) if not matched to a container entitygroup-title
(string) for releases within an collection/grouptranslation_of
(release identifier) if this release is a translation of another (usually under the same work)superceded
(boolean) if there is another release under the same work that should be referenced/indicated instead. Intended as a temporary hint until proper work-based search is implemented. As an example use, all arxiv release versions except for the most recent get this set.is_work_alias
(boolean): if true, then this release is an alias or pointer to the entire work, or the most recent version of the work. For example, some data repositories have separate DOIs for each version of the dataset, then an additional DOI that points to the "latest" version/DOI.
release_type
Vocabulary
This vocabulary is based on the CSL types, with a small number of (proposed) extensions:
article-magazine
article-journal
, including pre-prints and working papersbook
chapter
is allowed as they are frequently referenced and read independent of the entire book. The data model does not currently support linking a subset of a release to an entity representing the entire release. The release/work/file distinctions should not be used to group multiple chapters under a single work; a book chapter can be its own work. A paper which is republished as a chapter (eg, in a collection, or "edited" book) can have both releases under one work. The criteria of whether to "split" a book and have release entities for each chapter is whether the chapter has been cited/reference as such.dataset
entry
, which can be used for generic web resources like question/answer site entries.entry-encyclopedia
manuscript
paper-conference
patent
post-weblog
for blog entriesreport
review
, for things like book reviews, not the "literature review" form ofarticle-journal
, nor peer reviews (seepeer_review
). Notereview-book
for book reviews specifically.speech
can be used for eg, slides and recorded conference presentations themselves, as distinct frompaper-conference
thesis
webpage
peer_review
(fatcat extension)software
(fatcat extension)standard
(fatcat extension), for technical standards like RFCsabstract
(fatcat extension), for releases that are only an abstract of a larger work. In particular, translations. Many are granted DOIs.editorial
(custom extension) for columns, "in this issue", and other content published along peer-reviewed content in journals. Many are granted DOIs.letter
for "letters to the editor", "authors respond", and sub-article-length published content. Many are granted DOIs.stub
(fatcat extension) for releases which have notable external identifiers, and thus are included "for completeness", but don't seem to represent a "full work".component
(fatcat extension) for sub-components of a full paper or other work. Eg, tables, or individual files as part of a dataset.
An example of a stub
might be a paper that gets an extra DOI by accident; the
primary DOI should be a full release, and the accidental DOI can be a stub
release under the same work. stub
releases shouldn't be considered full
releases when counting or aggregating (though if technically difficult this may
not always be implemented). Other things that can be categorized as stubs
(which seem to often end up mis-categorized as full articles in bibliographic
databases):
- commercial advertisements
- "trap" or "honey pot" works, which are fakes included in databases to detect re-publishing without attribution
- "This page is intentionally blank"
- "About the author", "About the editors", "About the cover"
- "Acknowledgments"
- "Notices"
All other CSL types are also allowed, though they are mostly out of scope:
article
(generic; should usually be some other type)article-newspaper
bill
broadcast
entry-dictionary
figure
graphic
interview
legislation
legal_case
map
motion_picture
musical_score
pamphlet
personal_communication
post
review-book
song
treaty
For the purpose of statistics, the following release types are considered "papers":
article
article-journal
chapter
paper-conference
thesis
release_stage
Vocabulary
These roughly follow the DRIVER publication version guidelines, with the addition of a retracted
status.
draft
is an early version of a work which is not considered for peer review. Sometimes these are posted to websites or repositories for early comments and feedback.submitted
is the version that was submitted for publication. Also known as "pre-print", "pre-review", "under review". Note that this doesn't imply that the work was every actually submitted, reviewed, or accepted for publication, just that this is the version that "would be". Most versions in pre-print repositories are likely to have this status.accepted
is a version that has undergone peer review and accepted for published, but has not gone through any publisher copy editing or re-formatting. Also known as "post-print", "author's manuscript", "publisher's proof".published
is the version that the publisher distributes. May include minor (gramatical, typographical, broken link, aesthetic) corrections. Also known as "version of record", "final publication version", "archival copy".updated
: post-publication significant updates (considered a separate release in Fatcat). Also known as "correction" (in the context of either a published "correction notice", or the full new version)retraction
for post-publication retraction notices (should be a release under the same work as thepublished
release)
Note that in the case of a retraction, the original publication does not get
state retracted
, only the retraction notice does. The original publication
does get a withdrawn_status
metadata field set.
When blank, indicates status isn't known, and wasn't inferred at creation time.
Can often be interpreted as published
, but be careful!
withdrawn_status
Vocabulary
Don't know of an existing controlled vocabulary for things like retractions or other reasons for marking papers as removed from publication, so invented my own. These labels should be considered experimental and subject to change.
Note that some of these will apply more to pre-print servers or publishing accidents, and don't necessarily make sense as a formal change of status for a print journal publication.
Any value at all indicates that the release should be considered "no longer published by the publisher or primary host", which could mean different things in different contexts. As some concrete examples, works are often accidentally generated a duplicate DOI; physics papers have been taken down in response to government order under national security justifications; papers have been withdrawn for public health reasons (above and beyond any academic-style retraction); entire journals may be found to be predatory and pulled from circulation; individual papers may be retracted by authors if a serious mistake or error is found; an author's entire publication history may be retracted in cases of serious academic misconduct or fraud.
withdrawn
is generic: the work is no longer available from the original publisher. There may be no reason, or the reason may not be known yet.retracted
for when a work is formally retracted, usually accompanied by a retraction notice (a separate release under the same work). Note that the retraction itself should not have awithdrawn_status
.concern
for when publishers release an "expression of concern", often indicating that the work is not reliable in some way, but not yet formally retracted. In this case the original work is probably still available, but should be marked as suspect. This is not the same as presence of errata.safety
for works pulled for public health or human safety concerns.national-security
for works pulled over national security concerns.spam
for content that is considered spam (eg, bogus pre-print or repository submissions). Not to be confused with advertisements or product reviews in journals.
contribs.role
Vocabulary
author
translator
illustrator
editor
All other CSL role types are also allowed, though are mostly out of scope for Fatcat:
collection-editor
composer
container-author
director
editorial-director
editortranslator
interviewer
original-author
recipient
reviewed-author
If blank, indicates that type of contribution is not known; this can often be interpreted as authorship.
More About DOIs
All DOIs stored in an entity column should be registered (aka, should be
resolvable from doi.org
). Invalid identifiers may be cleaned up or removed by
bots.
DOIs should always be stored and transferred in lower-case form. Note that there are almost no other constraints on DOIs (and handles in general): they may have multiple forward slashes, whitespace, of arbitrary length, etc. Crossref has a number of examples of such "valid" but frustratingly formatted strings.
In the Fatcat ontology, DOIs and release entities are one-to-one.
It is the intention to automatically (via bot) create a Fatcat release for every Crossref-registered DOI from an allowlist of media types ("journal-article" etc, but not all), and it would be desirable to auto-create entities for in-scope publications from all registrars. It is not the intention to auto-create a release for every registered DOI. In particular, "sub-component" DOIs (eg, for an individual figure or table from a publication) aren't currently auto-created, but could be stored in "extra" metadata, or on a case-by-case basis.
Work Entity Reference
Works have no fields! They just group releases.
REST API
The Fatcat HTTP API is a read-only API for querying and searching the catalog.
A declarative specification of all API endpoints, JSON data models, and response types is available in OpenAPI 3.1 format. Auto-generated reference documentation is, for now, available at https://scholar.archive.org/docs.
All API traffic is over HTTPS. All endpoints accept and return only JSON serialized content.
Bulk Exports
There are several types of bulk exports and database dumps folks might be interested in:
- complete database dumps
- changelog history with all entity revisions and edit metadata
- identifier snapshot tables
- entity exports
All exports and dumps get uploaded to the Internet Archive under the "Fatcat Database Snapshots and Bulk Metadata Exports" collection.
Complete Database Dumps
The most simple and complete bulk export. Useful for disaster recovery, mirroring, or forking the entire service. The internal database schema is not stable, so not as useful for longitudinal analysis. These dumps will include edits-in-progress, deleted entities, old revisions, etc, which are potentially difficult or impossible to fetch through the API.
Public copies may have some tables redacted (eg, API credentials).
Dumps are in PostgreSQL pg_dump
"tar" binary format, and can be restored
locally with the pg_restore
command. See ./extra/sql_dumps/
for commands
and details. Dumps are on the order of 100 GBytes (compressed) and will grow
over time.
Changelog History
These are currently unimplemented; would involve "hydrating" sub-entities into changelog exports. Useful for some mirrors, and analysis that needs to track provenance information. Format would be the public API schema (JSON).
All information in these dumps should be possible to fetch via the public API, including on a feed/streaming basis using the sequential changelog index. All information is also contained in the database dumps.
Identifier Snapshots
Many of the other dump formats are very large. To save time and bandwidth, a few simple snapshot tables can be exported directly in TSV format. Because these tables can be dumped in single SQL transactions, they are consistent point-in-time snapshots.
One format is per-entity identifier/revision tables. These contain active, deleted, and redirected identifiers, with revision and redirect references, and are used to generate the entity dumps below.
Other tables contain external identifier mappings or file hashes.
Release abstracts can be dumped in their own table (JSON format), allowing them to be included only by reference from other dumps. The copyright status and usage restrictions on abstracts are different from other catalog content; see the metadata licensing section for more context. Abstracts are immutable and referenced by hash in the database, so the consistency of these dumps is not as much of a concern as with other exports.
Unlike all other dumps and public formats, the Fatcat identifiers in these dumps are in raw UUID format (not base32-encoded), though this may be fixed in the future.
See ./extra/sql_dumps/
for scripts and details. Dumps are on the order of a
couple GBytes each (compressed).
Entity Exports
Using the above identifier snapshots, the Rust fatcat-export
program outputs
single-entity-per-line JSON files with the same schema as the HTTP API. These
might contain the default fields, or be in "expanded" format containing
sub-entities for each record.
Only "active" entities are included (not deleted, work-in-progress, or redirected entities).
These dumps can be quite large when expanded (over 100 GBytes compressed), but do not include history so will not grow as fast as other exports over time. Not all entity types are dumped at the moment; if you would like specific dumps get in touch!
Fatcat Identifiers
Fatcat identifiers are semantically meaningless fixed-length random numbers, usually represented in case-insensitive base32 format. Each entity type has its own identifier namespace.
128-bit (UUID size) identifiers encode as 26 characters (but note that not all such strings decode to valid UUIDs), and in the backend can be serialized in UUID columns:
work_rzga5b9cd7efgh04iljk8f3jvz
https://scholar.archive.org/fatcat/work/rzga5b9cd7efgh04iljk8f3jvz
In comparison, 96-bit identifiers would have 20 characters and look like:
work_rzga5b9cd7efgh04iljk
https://scholar.archive.org/fatcat/work/rzga5b9cd7efgh04iljk
and 64-bit:
work_rzga5b9cd7efg
https://scholar.archive.org/fatcat/work/rzga5b9cd7efg
Fatcat identifiers can used to interlink between databases, but are explicitly not intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered" persistent identifiers for general use.
Internal Schema
Internally, identifiers are lightweight pointers to "revisions" of an entity. Revisions are stored in their complete form, not as a patch or difference; if comparing to distributed version control systems (for managing changes to source code), this follows the git model, not the mercurial model.
The entity revisions are immutable once accepted; the editing process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by identifier not revision number. Identifier pointers also support (versioned) deletion and redirects (for merging entities).
Edit objects represent a change to a single entity; edits get batched together into edit groups (like "commits" and "pull requests" in git parlance).
SQL tables look something like this (with separate tables for entity type a la
work_revision
and work_edit
):
entity_ident
id (uuid)
current_revision (entity_revision foreign key)
redirect_id (optional; points to another entity_ident)
is_live (boolean; whether newly created entity has been accepted)
entity_revision
revision_id
<all entity-style-specific fields>
extra: json blob for schema evolution
entity_edit
timestamp
editgroup_id (editgroup foreign key)
ident (entity_ident foreign key)
new_revision (entity_revision foreign key)
new_redirect (optional; points to entity_ident table)
previous_revision (optional; points to entity_revision)
extra: json blob for provenance metadata
editgroup
editor_id (editor table foreign key)
description
extra: json blob for provenance metadata
An individual entity can be in the following "states", from which the given actions (transition) can be made:
wip
(not live; not redirect; has rev)- activate (to
active
)
- activate (to
active
(live; not redirect; has rev)- redirect (to
redirect
) - delete (to
deleted
)
- redirect (to
redirect
(live; redirect; rev or not)- split (to
active
) - delete (to
delete
)
- split (to
deleted
(live; not redirect; no rev)- redirect (to
redirect
) - activate (to
active
)
- redirect (to
"WIP, redirect" or "WIP, deleted" are invalid states.
Additional entity-specific columns hold actual metadata. Additional
tables (which reference both entity_revision
and entity_id
foreign
keys as appropriate) represent things like authorship relationships
(creator/release), citations between works, etc. Every revision of an entity
requires duplicating all of these associated rows, which could end up
being a large source of inefficiency, but is necessary to represent the full
history of an object.
Sources
The core metadata bootstrap sources, by entity type, are:
releases
: Crossref metadata, with DOIs as the primary identifier, and PubMed (central), Wikidata, and CORE identifiers cross-referencedcontainers
: munged metadata from the DOAJ, ROAD, and Norwegian journal list, with ISSN-Ls as the primary identifier. ISSN provides an "ISSN to ISSN-L" mapping to normalize electronic and print ISSN numbers.creators
: ORCID metadata and identifier.
Initial file
metadata and matches (file-to-release) come from earlier
Internet Archive matching efforts, and in particular efforts to extra
bibliographic metadata from PDFs (using GROBID) and fuzzy match (with
conservative settings) to Crossref metadata.
The intent is to continuously ingest and merge metadata from a small number of large (~2-3 million more more records) general-purpose aggregators and catalogs in a centralized fashion, using bots, and then support volunteers and organizations in writing bots to merge high-quality metadata from field or institution-specific catalogs.
Progeny information (where the metadata comes from, or who "makes specific claims") is stored in edit metadata in the data model. Value-level attribution can be achieved by looking at the full edit history for an entity as a series of patches.
Reference Graph (refcat)
In Summer 2021, the first version of a reference graph dataset, named "refcat", was released and integrated into the fatcat web interface. The dataset contains billions of references between papers in the fatcat catalog, as well as partial coverage of references from papers to books, to websites, and from Wikipedia articles to papers. This is a first step towards identifying links and references between scholarly works of all types preserved in archive.org.
The refcat dataset can be downloaded in JSON lines format from the archive.org "Fatcat Database Snapshots and Bulk Metadata Exports" collection, and is released under a CC-0 license for broad reuse. Acknowledgement and attribution for both the aggregated dataset and the original metadata sources is strongly encouraged (see below for provenance notes).
References can be browsed on fatcat on an "outbound" ("References") and "inbound" ("Cited By") basis for individual release entities. There are also special pages for Wikipedia articles ("outbound", such as Internet) and Open Library books ("inbound", such as The Gift). JSON versions of these pages are available, but do not yet represent a stable API.
How It Works
Raw reference data comes from multiple sources (see "provenance" below), but has the common structure of a "source" entity (which could be a paper, Wikipedia article, etc) and a list of raw references. There might be duplicate references for a single "source" work coming from different providers (eg, both Pubmed and Crossref reference lists). The goal is to match as many references as possible to the "target" work being referenced, creating a link from source to target. If a robust match is not found, the "unmatched" reference is retained and displayed in a human readable fashion if possible.
Depending on the source, raw references may be a simple "raw" string in an arbitrary citation style; may have been parsed or structured in fields like "title", "year", "volume", "issue"; might include a URL or identifier like an arxiv.org identifier; or may have already been matched to a specific target work by another party. It is also possible the reference is vague, malformed, mis-parsed, or not even a reference to a specific work (eg, "personal communication"). Based on the available structure, we might be able to do a simple identifier lookup, or may need to parse a string, or do "fuzzy" matching against various catalogs of known works. As a final step we take all original and potential matches, verify the matches, and attempt to de-duplicate references coming from different providers into a list of matched and unmatched references as output. The refcat corpus is the output of this process.
Two dominant modes of reference matching are employed: identifier based matching and fuzzy matching. Identifier based matching currently works with DOI, Arxiv ids, PMID and PMCID and ISBN. Fuzzy matching employs a scalable way to cluster documents (with pluggable clustering algorithms). For each cluster of match candidates we run a more extensive verification process, which yields a match confidence category, ranging from weak over strong to exact. Strong and exact matches are included in the graph.
All the code for this process is available open source:
- refcat: batch processing and matching pipeline, in Python and Go
- fuzzycat: Python verification code and "live" fuzzy matching
Metadata Provenance
The provenance for each reference in the index is tracked and exposed via the
match_provenance
field. A fatcat-
prefix to the field means that the
reference came through the refs
metadata field stored in the fatcat catalog,
but originally came from the indicated source. In the absence of fatcat-
, the
reference was found, updated, or extracted at indexing time and is not recorded
in the release
entity metadata.
Specific sources:
crossref
(andfatcat-crossref
): citations deposited by publishers as part of DOI registration. Crossref is the largest single source of citation metadata in refcat. These references may be linked to a specific DOI; contain structured metadata fields; or be in the form of a raw citation string. Sometimes they are "complete" for the given work, and sometimes they only include references which could be matched/linked to a target work with a DOI.fatcat-datacite
: same ascrossref
, but for the Datacite DOI registrar.fatcat-pubmed
: references, linked or not linked, from Pubmed/MEDLINE metadatafatcat
: references in fatcat where the original provenance can't be inferred (but could be manually found by inspecting the release edit history)grobid
: references parsed out of full-text PDFs using GROBIDwikipedia
: citations extracted from Wikipedia (see below for details)
Note that sources of reference metadata which have formal licensing restrictions, even CC-BY or ODC-BY licenses as used by several similar datasets, are not included in refcat.
Current Limitations and Known Issues
The initial Summer 2021 version of the index has a number of limitations. Feedback on features and coverage are welcome! We expect this dataset to be iterated over regularly as there are a few dimensions along which the dataset can be improved and extended.
The reference matching process is designed to eventually operate in both "batch" and "live" modes, but currently only "batch" output is in the index. This means that references from newly published papers are not added to the index in an ongoing fashion.
Fatcat "release" entities (eg, papers) are matched from a Spring 2021 snapshot. References to papers published after this time will not be linked.
Wikipedia citations come from the dataset Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia, by Singh, West, and Colavizza. This is a one-time corpus based on a May 2020 snapshot of English Wikipedia only, and is missing many current references and citations. Additionally, only direct identifier lookups (eg, DOI matches) are used, not fuzzy metadata matching.
Open Library "target" matches are based on a snapshot of Open Library works, and are matched either ISBN (extracted from citation string) or fuzzy metadata matching.
Crossref references are extracted from a January 2021 snapshot of Crossref metadata, and do not include many updates to existing works.
Hundreds of millions of raw citation strings ("unstructured") have not been parsed into a structured for fuzzy matching. We plan to use GROBID to parse these citation strings, in addition to the current use of GROBID parsing for references from fulltext documents.
The current GROBID parsing used version v0.6.0. Newer versions of GROBID have improved citation parsing accuracy, and we intend to re-parse all PDFs over time. Additional manually-tagged training datasets could improve GROBID performance even further.
In a future update, we intend to add Wayback (web archive) capture status and access links for references to websites (distinct from references to online journal articles or books). For example, references to an online news article or blog post would indicate the closest (in time, to the "source" publication date) Wayback captures to that web page, if available.
References are only displayed on fatcat, not yet on scholar.archive.org.
There is no current or planned mechanism for searching, sorting, or filtering article search results by (inbound) citation count. This would require resource-intensive transformations and continuous re-indexing of search indexes.
It is unclear how the batch-generated refcat dataset and API-editable release refs metadata will interact in the future. The original refs may eventually be dropped from the fatcat API, or at some point the refcat corpus may stabilize and be imported in to fatcat refs instead of being maintained as a separate dataset and index. It would be good to retain a mechanism for human corrections and overrides to the machine-generated reference graph.
Metadata Licensing
The Fatcat catalog content license is the Creative Commons Zero ("CC-0") license, which is effectively a public domain grant. This applies to the catalog metadata itself (titles, entity relationships, citation metadata, URLs, hashes, identifiers), as well as "meta-meta-data" provided by editors (edit descriptions, provenance metadata, etc).
The core catalog is designed to contain only factual information: "this work, known by this title and with these third-party identifiers, is believed to be represented by these files and published under such-and-such venue". As a norm, sourcing metadata (for attribution and provenance) is retained for each edit made to the catalog.
A notable exception to this policy are abstracts, for which no copyright claims or license is made. Abstract content is kept separate from core catalog metadata; downstream users need to make their own decision regarding reuse and distribution of this material.
As a social norm, it is expected (and appreciated!) that downstream users of the public API and/or bulk exports provide attribution, and even transitive attribution (acknowledging the original source of metadata contributed to Fatcat). As an academic norm, researchers are encouraged to cite the corpus as a dataset (when this option becomes available). However, neither of these norms are enforced via the copyright mechanism.
For Publishers
This page addresses common questions and concerns from publishers of research works indexed in Fatcat, as well as the Internet Archive Scholar service built on top of it.
For help in exceptional cases, contact Internet Archive through our usual support channels.
Metadata Indexing
Many publishers will find that metadata records are already included in fatcat if they register persistent identifiers for their research works. This pipeline is based on our automated harvesting of DOI, Pubmed, dblp, DOAJ, and other metadata catalogs. This process can take some time (eg, days from registration), does not (yet) cover all persistent identifiers, and will only cover those works which get identifiers.
For publishers who find that they are not getting indexed in fatcat, our primary advice is to register ISSNs for venues (journals, repositories, conferences, etc), and to register DOIs for all current and back-catalog works. DOIs are the most common and integrated identifier in the scholarly ecosystem, and will result in automatic indexing in many other aggregators in addition to fatcat/scholar. There may be funding or resources available for smaller publishers to cover the cost of DOI registration, and ISSN registration is usually no-cost or affordable through national institutions.
We do not recommend that journal or conference publishers use general-purpose repositories like Zenodo to obtain no-cost DOIs for journal articles. These platforms are a great place for pre-publication versions, datasets, software, and other artifacts, but not for primary publication-version works (in our opinion).
If DOI registration is not possible, one good alternative is to get included in the Directory of Open Access Journals and deposit article metadata there. This process may take some time, but is a good basic indicator of publication quality. DOAJ article metadata is periodically harvested and indexed in fatcat, after a de-duplication process.
Improving Automatic Preservation
In alignment with its mission, Internet Archive makes basic automated attempts to capture and preserve all open access research publications on the public web, at no cost. This effort comes with no guarantees around completeness, timeliness, or support communications.
Preservation coverage can be monitored through the journal-specific dashboards or via the coverage search interface.
There are a few technical things publishers can do to increase their preservation coverage, in addition to the metadata indexing tips above:
- use the
citation_pdf_url
HTML meta tag, when appropriate, to link directly from article landing pages to PDF URLs - use simple HTML to represent landing pages and article content, and do not require Javascript to render page content or links
- ensure that hosting server
robots.txt
rules are not preventing or overly restricting automated crawling - use simple, accessible PDF access links. Do not use time-limited or IP-limited URLs, require specific referrer headers, or use cookies to authenticate access to OA PDFs
- minimize the number of HTTP redirects and HTML hops between DOI and fulltext content
- paywalls, loginwalls, geofencing, and anti-bot measures are all obviously antithetical to open crawling and indexing
Official Preservation
Internet Archive is developing preservation services for scholarly content on the web. Contact us at scholar@archive.org for details.
Existing web archiving services offered to universities, national libraries, and other institutions may already be appropriate for some publications. Check if your affiliated institutions already have an Archive-IT account or other existing relationship with Internet Archive.
Small publishers using Open Journal System (OJS) should be aware of the PKP preservation project.
Presentations
2020 Workshop On Open Citations And Open Scholarly Metadata 2020 - Fatcat (video on archive.org)
2019-10-25 FORCE2019 - Perpetual Access Machines: Archiving Web-Published Scholarship at Scale (video on youtube.com)
Blog Posts And Press
2021-03-09: blog.archive.org - Search Scholarly Materials Preserved in the Internet Archive
2020-09-17 blog.dshr.org - Don't Say We Didn't Warn You
2020-09-15: blog.archive.org - How the Internet Archive is Ensuring Permanent Access to Open Access Journal Articles
2020-02-18 blog.dshr.org - The Scholarly Record At The Internet Archive
2019-04-18 blog.dshr.org - Personal Pods and Fatcat
2018-10-03 blog.dshr.org - Brief Talk At Internet Archive Event
2018-03-05 blog.archive.org - Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation