Open Library provides dumps of all its data, generated every month. Most of the data dumps are formatted as tab separated files with the following columns:
-
type
- type of record (/type/edition, /type/work etc.)
-
key
- unique key of the record (/books/OL1M etc.)
-
revision
- revision number of the record
-
last_modified
- last modified timestamp
-
JSON
- the complete record in JSON format
Dumps
-
editions dump (~ 9.2G)
-
works dump (~ 2.9G)
-
authors dump (~ 0.5G)
-
all types dump (~ 12.4G): includes editions, works, authors, redirects, etc.
-
complete dump (~ 29.6G): also includes past revisions of all the records in Open Library
-
ratings dump (~ 5M): with columns: "Work Key, Edition Key (optional), Rating, Date"
-
reading log dump (~ 65M): with columns "Work Key, Edition Key (optional), Shelf, Date"
-
redirects dump (~ 50M)
-
deletes dump (~ 75M)
-
lists dump (~ 30M)
-
other dump (~ 10M)
- covers metadata dump (~ 70M): with columns "id, width, height, created"
For past dumps, see: https://archive.org/details/ol_exports?sort=-publicdate
Downloading the dumps take too long? Checkout the link above and download via torrent for higher speeds!
Format of JSON records
A JSON schema for the various types is located at https://github.com/internetarchive/openlibrary-client/tree/master/olclient/schemata
-
Author Records: JSON serialization of a type/author
-
Edition Records: JSON serialization of a type/edition
- Work Records: JSON serialization of a type/work
Using Open Library Data Dumps
This guide by contributor on the LibrariesHacked
GitHub about how to load Open Library's data dumps into PostgreSQL to make it more easily queriable:
https://github.com/LibrariesHacked/openlibrary-search
DuckDB
DuckDB is another easy tool to query the dump without much work.
For example:
If you wanted to get all the Wikidata IDs currently in the authors table:
`<br />
SELECT json_extract(column4, '$.remote_ids.wikidata') as wikidata_id<br />
FROM read_csv('ol_dump_authors_2024-07-31.txt.gz') <br />
WHERE wikidata_id IS NOT NULL <br />
LIMIT 100;<br />
`
GraphQL
DiFronzo on GitHub has produced a GraphQL proxy to search books using work, edition and ISBN with the Open Library API. Deployed with Deno and GraphQL:
https://github.com/DiFronzo/OpenLibrary-GraphQL
DiFronzo/OpenLibrary-GraphQL
OL Covers Dump
We do not yet have rolling monthly dumps of our book covers, despite a shared desire for their existence. Some historical cover dumps may be explored here:
https://archive.org/details/ol_data?tab=collection&query=identifier%3Acovers&sort=-addeddate
Most covers are archived in the following items. Note covers_0006
and covers_0007
are presently unavailable.
-
https://archive.org/details/covers_0000
-
https://archive.org/details/covers_0001
-
https://archive.org/details/covers_0002
-
https://archive.org/details/covers_0003
-
https://archive.org/details/covers_0004
-
https://archive.org/details/covers_0005
-
https://archive.org/details/covers_0008
-
https://archive.org/details/covers_0009
-
https://archive.org/details/covers_0010
-
https://archive.org/details/covers_0011
-
https://archive.org/details/covers_0012
-
https://archive.org/details/covers_0013
- https://archive.org/details/covers_0014
History
- Created December 14, 2011
- 36 revisions
January 10, 2025 | Edited by raybb | fix typos |
January 4, 2025 | Edited by raybb | add DuckDB note |
August 7, 2024 | Edited by Drini | Fix dump sizes / instructions |
August 7, 2024 | Edited by Drini | New dumps are now available! |
December 14, 2011 | Created by Anand Chitipothu | Documented Open Library Data Dumps |