Looking for a technical intro video walk-through to help you get started?
Index
-
Webpy python microframework
-
Infogami wiki batteries for webpy
-
Macros to embed functions in markdown
-
Infobase infogami's db orm
-
Infostore multi-db support for infobase
-
Coverthing an open repository of book covers
-
Solr
-
Production Architecture (Legacy Architecture & Legacy Hardware)
-
Partner Integrations
-
Open Library Source code
- Performance Issues
Overview
Open Library is powered by Infogami, a wiki application framework built on web.py. Unlike other wikis, Infogami has the flexibility to handle different classes of data, including structured data. That makes it the perfect platform for Open Library.
Open Library also uses a text-to-HTML formatting language Markdown, created by John Gruber. We also use the handy WMD Markdown WYSIWYG editor.
Original Architecture (2007)
Web server: lighttpd http server runs infogami through FastCGI interface using Flup. (There can be multiple concurrent infogami instances that the lighttpd server distributes requests between, although we currently just run one.) Infogami is written
in Python (we currently require 2.5 or greater) and uses web.py and ThingDB. ThingDB uses PostgreSQL as its data store. Psycopg2 is the Python driver for PostgreSQL. We use supervise (see also daemontools) to make sure everything keeps running.
Templates: The infogami application relies on various Web templates (these are code+html snippets). The initial templates are static files but they get edited through the wiki interface, and new ones get added through the wiki, so the real versions live entirely in the database.
Search: Infogami also accepts plug-ins and we use one for the Solr search engine. Solr is a JSP currently sitting in a Jetty http server, so it communicates with Infogami through a local http socket. Solr itself wraps the Lucene search library. These run under Java (we're currently using Java 1.5, I think). Solr is built under Apache Ant and has a few config and schema files, plus a startup script (solr.sh) that has to be manually edited to set the port number. I think we currently use Lucene as a downloaded .jar file so we don't build it.
Search plugin: The solr-infogami plugin also calls out to an archive.org PHP script that expands basic search queries to advanced queries. It may also start using the openlibrary.org flipbook (with some possible customizations) to display OCA scans for pages containing fulltext search results.
Data: We have a bunch of catalog data and fulltext acquired from various sources, either sitting in the Archive or to be uploaded to there. I think the acquisition processes (including web crawling scripts for some of the data) is outside the scope of an Open Library software install. There are a bunch of additional scripts to make the stuff usable in openlibrary and these need to be documented. These include TDB Conversion Scripts written by dbg, and (for OCA fulltext) Archive Spidering and Solr Importing scripts written by phr.
Infobase
We created Infobase, a new database framework that gives us this flexibility. Infobase stores a collection of objects, called "things". For example, on the Open Library site, each page, book, author, and user is a thing in the database. Each thing then has a series of arbitrary key-value pairs as properties. For example, a book thing may have the key "title" with the value "A Heartbreaking Work of Staggering Genius" and the key "genre" with the value "Memoir". Each collection of key-value pairs is stored as a version, along with the time it was saved and the person who saved it. This allows us to store full structured data, as well as travel back thru time to retrieve old versions of it.
Infobase is built on top of PostgreSQL, but its interface is abstract enough to allow it to be moved to other backends as performance requires. The current schema of Infobase tables looks like:
Table site
id
name (string)
TABLE thing
id
site_id (references site)
key (string)
[(site_id, key) combinations are unique]
TABLE version
id
revision (int)
thing_id (references thing)
author_id (references thing)
ip (ip address)
comment (string)
created (datetime)
[(thing_id, revision) combinations are unique]
TABLE datum
thing_id (references thing)
begin_revision (int)
end_revision (int)
key (string)
value (string)
datatype ('string', 'reference', 'int', 'float', or 'date')
ordering (integer, default null)
From Python, the infobase interface looks like this:
# retrieve the book object
foo = site.get('/foo')
assert foo.title == "The Story of Foo"
# query for books by that author
foos = site.things(dict(author="Joe Jacobson"))
assert foos[0].title == "The Story of Foo"
Infobase also has a programmable API, which can be used to build applications using the Open Library data.
Overview
Note: This data may be quite old, please check our github Dockerifles for the latest.
Web server: nginx (formerly lighttpd) http server runs infogami through gunicorn (formerly FastCGI interface using Flup). (There can be multiple concurrent infogami instances that the lighttpd server distributes requests between, although we currently just run one.) Infogami is written in Python (we currently require 2.5 or greater) and uses web.py and ThingDB. ThingDB uses PostgreSQL as its data store. Psycopg2 is the Python driver for PostgreSQL. We use supervise (see also daemontools) to make sure everything keeps running.
Templates: The infogami application relies on various Web templates (these are code+html snippets). The initial templates are static files but they get edited through the wiki interface, and new ones get added through the wiki, so the real versions live entirely in the database.
Search: Infogami also accepts plug-ins and we use one for the Solr search engine. Solr is a JSP currently sitting in a Jetty http server, so it communicates with Infogami through a local http socket. Solr itself wraps the Lucene search library. These run under Java. Solr is built under Apache Ant and has a few config and schema files, plus a startup script (solr.sh) that has to be manually edited to set the port number. I think we currently use Lucene as a downloaded .jar file so we don't build it.
Search plugin: The solr-infogami plugin also calls out to an archive.org PHP script that expands basic search queries to advanced queries. It may also start using the openlibrary.org flipbook (with some possible customizations) to display OCA scans for pages containing fulltext search results.
Data: We have a bunch of catalog data and fulltext acquired from various sources, either sitting in the Archive or to be uploaded to there. I think the acquisition processes (including web crawling scripts for some of the data) is outside the scope of an Open Library software install. There are a bunch of additional scripts to make the stuff usable in openlibrary and these need to be documented. These include TDB Conversion Scripts written by dbg, and (for OCA fulltext) Archive Spidering and Solr Importing scripts written by phr.
Infogami
Simply building a new database wasn't enough. We needed to build a new wiki to take advantage of it. So we built Infogami. Infogami is a cleaner, simpler wiki. But unlike other wikis, it has the flexibility to handle different classes of data. Most wikis only let you store unstructured pages -- big blocks of text. Infogami lets you store structured data, just like Infobase does, as well as use infobase's query powers to sort through it.
Each infogami page (i.e. something with a URL) has an associated type. Each type contains a schema that states what fields can be used with it and what format those fields are in. Those are used to generate view and edit templates which can then be further customized as a particular type requires.
The result, as you can see on the Open Library site, is that one wiki contains pages that represent books, pages that represent authors, and pages that are simply wiki pages, each with their own distinct look and edit templates and set of data.
Open Library Extensions
Infogami is also open to expansion. It has a rich plugin framework that lets us build exciting site-specific features on top of it. So we've added specific Open Library technology to help us handle things like the search engine. We also hope to develop plugins to handle reviews, price checking, and other important features to the site.
Partner Tools & Integrations
Open Library graciously uses Browserstack for cross browser compatibility testing, GitHub for hosting our public code repository, and GitHub Actions.
History
- Created March 4, 2009
- 638 revisions
August 28, 2023 | Edited by raybb | remove mention of Travis CI |
December 14, 2021 | Edited by Mek | Edited without comment. |
December 14, 2021 | Edited by Mek | Edited without comment. |
December 14, 2021 | Edited by Mek | changing permission of page to librarians |
March 4, 2009 | Created by webchick | Creating .de /about/tech page |