Open Library Data Dumps

Open Library provides dumps of all its data, generated every month. Most of the data dumps are formatted as tab separated files with the following columns:

type - type of record (/type/edition, /type/work etc.)
key - unique key of the record (/books/OL1M etc.)
revision - revision number of the record
last_modified - last modified timestamp
JSON - the complete record in JSON format

Dumps

editions dump (~ 9.2G)
works dump (~ 2.9G)
authors dump (~ 0.5G)
all types dump (~ 12.4G): includes editions, works, authors, redirects, etc.
complete dump (~ 29.6G): also includes past revisions of all the records in Open Library
ratings dump (~ 5M): with columns: "Work Key, Edition Key (optional), Rating, Date"
reading log dump (~ 65M): with columns "Work Key, Edition Key (optional), Shelf, Date"
redirects dump (~ 50M)
deletes dump (~ 75M)
lists dump (~ 30M)
other dump (~ 10M)
covers metadata dump (~ 70M): with columns "id, width, height, created"

For past dumps, see: https://archive.org/details/ol_exports?sort=-publicdate

Downloading the dumps take too long? Checkout the link above and download via torrent for higher speeds!

Format of JSON records

A JSON schema for the various types is located at https://github.com/internetarchive/openlibrary-client/tree/master/olclient/schemata

Author Records: JSON serialization of a type/author
Edition Records: JSON serialization of a type/edition
Work Records: JSON serialization of a type/work

Using Open Library Data Dumps

This guide by contributor on the LibrariesHacked GitHub about how to load Open Library's data dumps into PostgreSQL to make it more easily queriable:
https://github.com/LibrariesHacked/openlibrary-search

DuckDB

DuckDB is another easy tool to query the dump without much work.

For example:
If you wanted to get all the Wikidata IDs currently in the authors table:
` SELECT json_extract(column4, '$.remote_ids.wikidata') as wikidata_id FROM read_csv('ol_dump_authors_2024-07-31.txt.gz') WHERE wikidata_id IS NOT NULL LIMIT 100; `

GraphQL

DiFronzo on GitHub has produced a GraphQL proxy to search books using work, edition and ISBN with the Open Library API. Deployed with Deno and GraphQL:

https://github.com/DiFronzo/OpenLibrary-GraphQL
DiFronzo/OpenLibrary-GraphQL

OL Covers Dump

We do not yet have rolling monthly dumps of our book covers, despite a shared desire for their existence. Some historical cover dumps may be explored here:
https://archive.org/details/ol_data?tab=collection&query=identifier%3Acovers&sort=-addeddate

Most covers are archived in the following items. Note covers_0006 and covers_0007 are presently unavailable.

History

Created December 14, 2011
36 revisions

January 10, 2025	Edited by raybb	fix typos
January 4, 2025	Edited by raybb	add DuckDB note
August 7, 2024	Edited by Drini	Fix dump sizes / instructions
August 7, 2024	Edited by Drini	New dumps are now available!
December 14, 2011	Created by Anand Chitipothu	Documented Open Library Data Dumps