There are some queries which are taking more than a minute to execute. The issues and causes are discussed in document along with some ideas to rectify them.
Access Pattern
The data access pattern can be divided in to 3 groups.
-
Read all properties of an object at latest or a previous revision
-
Find keys of all objects matching a particular query
- Find all versions matching a particular query
Schema
The database schema has 3 tables, thing
, version
and datum
.
thing: id key latest_revision last_modified created version: thing_id revision created comment ip machine_comment datum: thing_id begin_revision end_revision key value datatype ordering
Performance Issues
Some queries are taking more than a minute to execute. Lets look at an example.
Here is the query plan for things query {"type": "/type/edition", "authors": "/a/OL1458403A", "limit": "20"}
.
infobase_production=# explain SELECT thing.key FROM thing, datum as d0, datum as d1 WHERE thing.site_id = 1 AND d0.thing_id = thing.id AND d0.end_revision = 2147483647 AND d0.key = 'authors' AND cast(d0.value as int) = 4955588 AND d0.datatype = 0 AND d1.thing_id = thing.id AND d1.end_revision = 2147483647 AND d1.key = 'type' AND cast(d1.value as int) = 52 AND d1.datatype = 0 LIMIT 20 OFFSET 0; QUERY PLAN ----------------------------------------------------------------------- Limit (cost=58668.07..59314.62 rows=7 width=17) -> Nested Loop (cost=58668.07..59314.62 rows=7 width=17) -> Merge Join (cost=58668.07..58777.90 rows=53 width=8) Merge Cond: (d0.thing_id = d1.thing_id) -> Sort (cost=24322.35..24344.94 rows=9035 width=4) Sort Key: d0.thing_id -> Bitmap Heap Scan on datum d0 (cost=1673.50..23728.70 rows=9035 width=4) Recheck Cond: (("key" = 'authors'::text) AND ((value)::integer = 4955588) AND (datatype = 0) AND (end_revision = 2147483647)) -> Bitmap Index Scan on datum_key_value_ref_idx (cost=0.00..1671.24 rows=9035 width=0) Index Cond: (("key" = 'authors'::text) AND ((value)::integer = 4955588)) -> Sort (cost=34345.71..34377.78 rows=12825 width=4) Sort Key: d1.thing_id -> Bitmap Heap Scan on datum d1 (cost=2371.88..33470.62 rows=12825 width=4) Recheck Cond: (("key" = 'type'::text) AND ((value)::integer = 52) AND (datatype = 0) AND (end_revision = 2147483647)) -> Bitmap Index Scan on datum_key_value_ref_idx (cost=0.00..2368.67 rows=12825 width=0) Index Cond: (("key" = 'type'::text) AND ((value)::integer = 52)) -> Index Scan using thing_pkey on thing (cost=0.00..10.11 rows=1 width=21) Index Cond: (d0.thing_id = thing.id) Filter: (site_id = 1)
For executing this query, more than 2 million disk blocks are read.
infobase_production=# select * from pg_statio_user_tables; relid | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | tidx_blks_read | tidx_blks_hit --------+------------+---------+----------------+---------------+---------------+--------------+-----------------+----------------+----------------+--------------- 304963 | public | datum | 2247162 | 657 | 31836 | 28822 | 0 | 0 | 0 | 0 304958 | public | account | 0 | 0 | | | 0 | 0 | 0 | 0 304987 | public | version | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 304977 | public | thing | 4 | 0 | 11 | 6 | 0 | 0 | 0 | 0 304970 | public | site | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 (5 rows)
To see these statistics, the parameter stats_block_level
must be enabled before executing the query.
infobase_production=# select set_config('stats_block_level', 'on', false); set_config ------------ on (1 row)
see http://www.postgresql.org/docs/8.2/static/monitoring-stats.html to learn more about "Monitoring Database Activity".
When postgresql doing Recheck Cond
, it reads the actual data blocks and checks the condition. Since all the rows in datum are stored together, the required rows are scattered all over. Which means reading more number of disk blocks. In the worst case, no two rows of the required property shares a disk block.
If all rows of a property are stored together, then the number of disk blocks read will be less.
Ideas for improving performance
Here are some ideas to improve the performance. Some of these ideas are subset of the other, some ideas can be implemented together.
Save type in thing table
Since, every object must have a type property, we can speed up the queries by keeping the type of the latest revision of every object in thing table. However, the type must still continue to be stored in the datum table, as it can change with revisions.
Pre-compute important queries
In the Open Library, the two most frequently used queries are:
-
getting books written by a particular author (from author view page).
- get all authors whose name starts with a prefix (from author author complete in edition edit page)
It is possible to pre-compute all this data and store it in memory to answer the queries fast. These data structures must also be updated on ever write operation.
Separate storage for reading objects and queries.
If we store data of each datatype in a separate table, reading objects will be comes slow because it has to read from multiple tables. Since we are allowing queries only on the latest revisions, the old revisions are not required for executing the queries. So it might be a good idea to separate reading objects from executing queries. One way to do this is to store the JSON representation of that object directly in the version table.
thing: id key latest_revision last_modified created version: thing_id revision created comment ip machine_comment data datum_int: thing_id key value int orderingdatum_ref: thing_id key value int references thing ordering datum_float: thing_id key value float ordering datum_string: thing_id key value text ordering
Partial Vertical Partitioning
As discussed in [2], Vertical Partitioning is a method of storing values for every property in a separate table. The authors claim that this approach is very scalable. However schema changes are painful in this approach. Changing schema requires creation and deletion of tables. In Infobase, schema changes are trivial to make. It would be nice to get the best of both approaches.
In every system, there will be some important parts of the schema, which will not change. For example in the Open Library, every book will always have a authors and every author will have a name. So if the system uses vertical partitioning on some specified properties and uses the default behavior on the rest, we'll get the best of both approaches.
References
[1] Freebase Blog: A Brief Tour of Graphd
[2] Scalable Semantic Web Data Management Using Vertical Partitioning (pdf)
Updated Performance after adding type column to thing table
Test 1: {"type": "/type/edition", "authors": "/a/OL1458403A", "limit": "20"}
.
SELECT thing.key FROM thing, datum as d0 WHERE thing.site_id = 1 and thing.type = 52 AND d0.thing_id = thing.id AND d0.end_revision = 2147483647 AND d0.key = 'authors' AND cast(d0.value as int) = 4955588 AND d0.datatype = 0 LIMIT 20 OFFSET 0;
Disk blocks read:
relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit ---------+----------------+---------------+---------------+-------------- thing | 3 | 1 | 6 | 6 datum | 3 | 1 | 5 | 0
Test 2: {"type": "/type/edition", "authors": "/a/OL1458403A", "publisher": "Dover Publications", "limit": "20"}
.
Query:
SELECT thing.key FROM thing, datum as d0, datum as d1 WHERE thing.site_id = 1 and thing.type = 52 AND d0.thing_id = thing.id AND d0.end_revision = 2147483647 AND d0.key = 'authors' AND cast(d0.value as int) = 4955588 AND d0.datatype = 0 AND d1.thing_id = thing.id AND d1.end_revision = 2147483647 AND d1.key = 'publishers' AND d1.value='Dover Publications' AND d1.datatype=2 LIMIT 20 OFFSET 0;
Total runtime: 9586.715 ms
Disk stats:
relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit ---------+----------------+---------------+---------------+-------------- thing | 0 | 0 | 0 | 0 datum | 3848 | 5160 | 118 | 48360
Test 3: {"type": "/type/edition", "authors": "/a/OL1458403A", "sort": "title", "limit": "20"}
.
SELECT thing.key FROM thing, datum as d0, datum as d1 WHERE thing.site_id = 1 and thing.type = 52 AND d0.thing_id = thing.id AND d0.end_revision = 2147483647 AND d0.key = 'authors' AND cast(d0.value as int) = 4955588 AND d0.datatype = 0 AND d1.thing_id = thing.id AND d1.end_revision = 2147483647 AND d1.key = 'title' AND d1.datatype=2 ORDER BY d1.value LIMIT 20 OFFSET 0;
Takes too long (more than 15 minutes).
History
- Created February 4, 2009
- 2 revisions
June 27, 2021 | Edited by Mek | Edited without comment. |
February 4, 2009 | Created by Anand Chitipothu | moving old docs from staging to production |