PostgreSQL Day DBA QuickGuide Ver: 0.11

PostgreSQL 9.1.3 2Day DBA QuickGuide Ver: 0.11
Author: Digoal.Zhou Phone: Mail: QQ: Blog:

Day 1 PostgreSQL Day DBA QuickGuide

Day 1 About PostgreSQL Tutorial SQL Language 索引并发控制 SQL性能优化安装初始化集群
配置集群启动/关闭集群连接PostgreSQL SQL Language SQL语法数据定义数据操作权限数据类型函数与操作符类型转换

About PostgreSQL Maximum size for a database? unlimited
Maximum size for a table? 32 TB Maximum size for a row? 400 GB Maximum size for a field? 1 GB Maximum number of rows in a table? unlimited Maximum number of columns in a table? depending on column types Maximum number of indexes on a table? unlimited

Tutorial 安装初始化集群配置集群启动/关闭集群连接PostgreSQL

Tutorial 版本安装前准备工作 PostgreSQL 9.1.3 CentOS 5.x 64 下载源码配置存储配置OS
配置存储配置OS

配置存储磁盘选择机械盘 VS SSD RAID选择 RAID5, RAID5/0, RAID1, RAID1/0
RAID5, RAID5/0 写性能差, 坏盘后性能下降严重, REBUILD耗时长. 可以使用n-1的容量. RAID1, RAID1/0 读写性能好, 坏盘后基本没有性能下降, REBUILD耗时短. 可以使用n/2的容量. 高端存储CACHE够大的情况下RAID5, RAID5/0写性能也可以接受. 存储CACHE 有掉电保护的情况下, 建议开启存储或RAID卡的写CACHE. 磁盘的CACHE一定要关闭. pg_test_fsync模块调用各种同步写函数测试存储处理IO的能力测试包含write-back和write-through

配置存储 open一次, 循环多次write fdatasync lseek 使用fdatasync函数测试同步写举例, 往同一个位置写.
【参考】Linux Programmer's Manual (open, write, lseek, fdatasync) open一次, 循环多次write fdatasync lseek

配置存储每次 open write close 不调用同步写函数的写举例
The aim of WAL is to ensure that the log is written before database records are altered, but this can be subverted by disk drives that falsely report a successful write to the kernel, when in fact they have only cached the data and not yet stored it on the disk. A power failure in such a situation might lead to irrecoverable data corruption. Administrators should try to ensure that disks holding PostgreSQL's WAL log files do not make such false reports. (See Section 29.1.) After a checkpoint has been made and the log flushed, the checkpoint's position is saved in the file pg_control. Therefore, at the start of recovery, the server first reads pg_controland then the checkpoint record; then it performs the REDO operation by scanning forward from the log position indicated in the checkpoint record. Because the entire content of data pages is saved in the log on the first page modification after a checkpoint (assuming full_page_writes is not disabled), all pages changed since the checkpoint will be restored to a consistent state. To deal with the case where pg_control is corrupt, we should support the possibility of scanning existing log segments in reverse order — newest to oldest — in order to find the latest checkpoint. This has not been implemented yet. pg_control is small enough (less than one disk page) that it is not subject to partial-write problems, and as of this writing there have been no reports of database failures due solely to the inability to read pg_control itself. So while it is theoretically a weak spot, pg_control does not seem to be a problem in practice.

配置OS 结合PostgreSQL编译安装时configure的选项, 有选择的安装OS的依赖包. OS配置
/etc/sysctl.conf /etc/security/limits.conf /etc/sysconfig/iptables 时间调整自动配置ntpd服务或者使用crontab如下 8 * * * * /usr/sbin/ntpdate asia.pool.ntp.org && /sbin/hwclock --systohc 设备管理逻辑卷, blockdev --setra 文件系统 XFS, ext3, ext4, ZFS noatime 添加用户配置环境【参考】 CentOS kernel-doc-x.x.xx-xxx OS包 %packages @admin-tools @base @chinese-support @core @development-libs @development-tools @dialup @editors @legacy-software-development @legacy-software-support @system-tools @text-internet @base-x kexec-tools fipscheck device-mapper-multipath sgpio perl-XML-SAX libstdc++44-devel perl-XML-NamespaceSupport perl-Crypt-SSLeay python-imaging python-dmidecode gcc44-c++ pexpect imake gcc44 emacs gtk+ compat-db openssl097a compat-readline43 compat-openldap compat-slang openmotif22 tog-pegasus audit sysstat xorg-x11-utils xorg-x11-server-Xnest xorg-x11-server-Xvfb net-snmp perl-XML-Parser perl-XML-Simple perl-Net-IP perl-Net-SSLeay e4fsprogs -yum-updatesd -bluez-utils -wvdial -minicom -isdn4k-utils -bluez-hcidump -bluez-gnome

配置OS #!/bin/bash vi /etc/sysctl.conf # simple shmsetup script
page_size=`getconf PAGE_SIZE` kernel.shmmax = phys_pages=`getconf _PHYS_PAGES` kernel.shmall = shmall=`expr $phys_pages` kernel.shmmni = 4096 shmmax=`expr $shmall \* $page_size` kernel.sem = echo kernel.shmmax = $shmmax fs.file-max = echo kernel.shmall = $shmall net.ipv4.ip_local_port_range = net.core.rmem_default = net.core.rmem_max = net.core.wmem_default = net.core.wmem_max = fs.aio-max-nr = sysctl -p

配置OS vi /etc/security/limits.conf vi /etc/sysconfig/iptables
* soft nofile -A RH-Firewall-1-INPUT -i lo -j ACCEPT * hard nofile # 允许源IP * soft nproc -A RH-Firewall-1-INPUT -s /16 -j ACCEPT * hard nproc * soft core unlimited # 允许源IP访问目标端口 * hard core unlimited # -A RH-Firewall-1-INPUT -s /24 -m state --state NEW -m tcp -p tcp --dport j ACCEPT * soft memlock * hard memlock # 允许任意IP访问目标端口 # -A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport j ACCEPT

配置OS useradd postgres vi ~/.bash_profile export PGPORT=5432
export PGDATA=/data01/pgdata/digoal/5432/pg_root export PGHOME=/opt/pgsql export PGHOST=$PGDATA export LANG=en_US.utf8 export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib export DATE=`date +"%Y%m%d%H%M"` export PATH=$PGHOME/bin:$PATH:. export MANPATH=$PGHOME/share/man:$MANPATH alias rm='rm -i' alias ll='ls -lh‘ 【参考】

安装 su - root tar -jxvf postgresql-9.1.3.tar.bz2
chown -R postgres:postgres postgresql-9.1.3 su - postgres cd postgresql-9.1.3 ./configure --prefix=/opt/pgsql --with-pgport= with-perl --with-python --with-openssl --with-pam --without-ldap --with-libxml --with-libxslt --enable-thread-safety gmake world gmake install-world 【参考】 postgresql-9.1.3/INSTALL If you want to build everything that can be built, including the documentation (HTML and man pages), and the additional modules ("contrib"), type instead: gmake world gmake install This will install files into the directories that were specified in step 1. Make sure that you have appropriate permissions to write into that area. Normally you need to do this step as root. Alternatively, you can create the target directories in advance and arrange for appropriate permissions to be granted. To install the documentation (HTML and man pages), enter: gmake install-docs If you built the world above, type instead: gmake install-world This also installs the documentation. You can use gmake install-strip instead of gmake install to strip the executable files and libraries as they are installed. This will save some space. If you built with debugging support, stripping will effectively remove the debugging support, so it should only be done if debugging is no longer needed. install-strip tries to do a reasonable job saving space, but it does not have perfect knowledge of how to strip every unneeded byte from an executable file, so if you want to save all the disk space you possibly can, you will have to do manual work. The standard installation provides all the header files needed for client application development as well as for server-side program development, such as custom functions or data types written in C. (Prior to PostgreSQL 8.0, a separate gmake install-all-headers command was needed for the latter, but this step has been folded into the standard install.) Client-only installation: If you want to install only the client applications and interface libraries, then you can use these commands: gmake -C src/bin install gmake -C src/include install gmake -C src/interfaces install gmake -C doc install "src/bin" has a few binaries for server-only use, but they are small.

初始化集群 initdb -A md5 -D $PGDATA -E UTF8 --locale=C -W -A, --auth=METHOD
default authentication method for local connections [-D, --pgdata=]DATADIR location for this database cluster -E, --encoding=ENCODING set default encoding for new databases --locale=LOCALE set default locale for new databases -W, --pwprompt prompt for a password for the new superuser

配置集群 pg_hba.conf # TYPE DATABASE USER ADDRESS METHOD
# "local" is for Unix domain socket connections only # local all all md5 local all all trust # IPv4 local connections: # host all all / md5 host all all / trust # IPv6 local connections: host all all ::1/ md5 host test all /0 md5 host postgres all /0 reject host all all /0 md5 从上至下匹配, 匹配后下面的不进行匹配操作. 详见pg_hba.conf

配置集群(列出部分配置) postgresql.conf max_wal_senders = 32
listen_addresses = ' ' hot_standby = on unix_socket_directory = '.' random_page_cost = 2.0 unix_socket_permissions = 0700 effective_cache_size = 12000MB shared_buffers = 512MB log_checkpoints = on maintenance_work_mem = 512MB log_statement = 'ddl' max_stack_depth = 8MB track_activity_query_size = 2048 shared_preload_libraries = 'pg_stat_statements' autovacuum = on log_autovacuum_min_duration = 0 wal_level = hot_standby custom_variable_classes = 'pg_stat_statements' wal_buffers = 16384kB synchronous_commit = off pg_stat_statements.max = 1000 wal_writer_delay = 10ms pg_stat_statements.track = all checkpoint_segments = 128 archive_mode = on archive_command = '/bin/date' 修改需要重启才能生效的参数,避免以后修改需要重启库.

配置集群可动态调整的配置修改后如何生效(包括pg_hba.conf) 静态配置修改后如何生效
pg_ctl reload -D $PGDATA 或者给postgres主进程发出SIGHUP信号静态配置修改后如何生效 pg_ctl stop -m fast -D $PGDATA pg_ctl start -D $PGDATA

启动/关闭集群启动 su - postgres pg_ctl start -D $PGDATA 关闭
pg_ctl stop -m fast -D $PGDATA

启动/关闭集群关闭的几种模式 smart(默认) 等待所有已经连接的客户端断开连接等待online 备份完成 fast
不等待客户端断开连接所有在进行的事务全部回滚, 然后断开连接如果有的户, 终止online 备份 immediate abort掉所有的进程, 最快的关集群方式, 但是重启集群时需要恢复. 一般用在紧急维护, 如UPS的电不够了, 需要马上停库, 停主机. 一般可以先用smart模式关集群, 执行后数据库将不允许新的连接进来. 等已有的连接事务执行完成后用再fast关集群. 尽量减少事务回滚的可能.

连接PostgreSQL 连接数据库 psql -h psql -h $ip psql -h unix_socket

SQL Language Say SQL

SQL Syntax

SQL Syntax-Lexical Structure
SELECT * FROM pg_class WHERE relname = 'pg_statistic' LIMIT 1; -- is one comment TOKEN keyword (SELECT FROM WHERE LIMIT) identifier or quoted identifier (pg_class relname, 表名, 列名,对象名...) 默认小写, 如需大写需使用双引号 literal or constant ('pg_statistic' 1) special character symbol (*) comment (-- is one-line comment) or (/* */ multi-line comment) operator (=)

keyword

identifier define NAMEDATALEN 64 (长度限制, 截断到63)
truncate_identifier(char *ident, int len, bool warn) { if (len >= NAMEDATALEN) len = pg_mbcliplen(ident, len, NAMEDATALEN - 1); if (warn) char buf[NAMEDATALEN]; memcpy(buf, ident, len); buf[len] = '\0'; ereport(NOTICE, (errcode(ERRCODE_NAME_TOO_LONG), errmsg("identifier \"%s\" will be truncated to \"%s\"", ident, buf))); } ident[len] = '\0'; 超过长度将截断到限制量-1, 同时报错. src/backend/parser/scansup.c /* * truncate_identifier() --- truncate an identifier to NAMEDATALEN-1 bytes. * * The given string is modified in-place, if necessary. A warning is * issued if requested. * We require the caller to pass in the string length since this saves a * strlen() call in some common usages. */ void truncate_identifier(char *ident, int len, bool warn) { if (len >= NAMEDATALEN) len = pg_mbcliplen(ident, len, NAMEDATALEN - 1); if (warn) * We avoid using %.*s here because it can misbehave if the data * is not valid in what libc thinks is the prevailing encoding. char buf[NAMEDATALEN]; memcpy(buf, ident, len); buf[len] = '\0'; ereport(NOTICE, (errcode(ERRCODE_NAME_TOO_LONG), errmsg("identifier \"%s\" will be truncated to \"%s\"", ident, buf))); } ident[len] = '\0'; src/include/pg_config_manual.h * Maximum length for identifiers (e.g. table names, column names, * function names). Names actually are limited to one less byte than this, * because the length must include a trailing zero byte. * Changing this requires an initdb. #define NAMEDATALEN 64

identifier 改变最大长度限制,需重新initdb src/include/pg_config_manual.h /*
* Maximum length for identifiers (e.g. table names, column names, * function names). Names actually are limited to one less byte than this, * because the length must include a trailing zero byte. * * Changing this requires an initdb. */ #define NAMEDATALEN 64

literal or constant implicitly-typed literal or constant
explicit-typed literal or constant string type 'string' E'digoal\\' time '12:00:00' $$digoal\$$ 'string'::type $tag$digoal\$tag$ '1 hour'::interval bit string CAST ( 'string' AS type ) B' ' CAST(' ' AS inet); number 10 or +10 -23.4 or 100.1 10e-1 98e+10 or 98e10

operator + - * / < > = ~ ! @ # % ^ & | ` ? ||
|| postgres=# select count(*) from pg_operator; count 706 SELECT 3 OPERATOR(pg_catalog.+) 4;

special character $ : string quoted slice from array
positional parameter in function or prepared statement * all the fields of a table or composite value () enforce precedence . [] numeric , separate schema, table, column names. array selected elements , separate the elements of a list ; terminate a SQL

SQL Syntax-Value Expressions
constant postgres=# PREPARE pre_1(text) AS SELECT id FROM digoal_t1 WHERE lower(info) ~ lower($1); 前面已经讲到 column reference postgres=# EXECUTE pre_1('post'); [schema_name.]table.column_name id alias.column_name ---- positional parameter 2 $number, 比如用在function中或者prepared sql中 postgres=# DEALLOCATE pre_1; 例如: postgres=# CREATE TABLE digoal_t1(id int,info text); postgres=# INSERT INTO digoal_t1 VALUES (1,'DIGOAL'),(2,'PostgreSQL'),(3,'Pgfoundry'),(4,'pgxn');

subscript ); 例如: CREATE TABLE on_hand ( SELECT a[2:3][1:2] FROM (SELECT ARRAY[[1,2,3],[4,5,6],[7,8,9],[10,11,12]] AS a) t; item inventory_item, count integer a INSERT INTO on_hand VALUES (ROW('fuzzy dice', 42, 1.99), 1000); {{4,5},{7,8}} postgres=# SELECT (item).name FROM on_hand WHERE (item).price > .99; field selection 从行类型或composite类型中指定field name CREATE TYPE inventory_item AS ( fuzzy dice name text, supplier_id integer, price numeric

operator invocation aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) OPERATOR(schema.operatorname) 例如: aggregate_name ( * ) postgres=# SELECT 3 OPERATOR(pg_catalog.+) 4; 例如, 把多行聚集成一个数组的聚集函数 ?column? postgres=# SELECT array_agg(id ORDER BY id desc) FROM (SELECT generate_series(1,10) AS id) AS t; 7 function call postgres=# SELECT now(); {10,9,8,7,6,5,4,3,2,1} :52: postgres=# SELECT array_agg(id ORDER BY id) FROM (SELECT generate_series(1,10) AS id) AS t; aggregate expression aggregate_name (expression [ , ... ] [ order_by_clause ] ) aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) {1,2,3,4,5,6,7,8,9,10}

window function call (11,'张学友','数学',89.5), 例如 (12,'张学友','语文',91.5), CREATE TABLE window_test(id int, name text, subject text, score numeric); (13,'张学友','英语',92.5), (14,'张学友','物理',93.5), (15,'张学友','化学',94.5); INSERT INTO window_test VALUES(1,'digoal','数学',99.5), -- 取出每门课程的第一名. (2,'digoal','语文',89.5), SELECT id,name,subject,score FROM (3,'digoal','英语',79.5), (4,'digoal','物理',99.5), (SELECT row_number() OVER (PARTITION BY subject ORDER BY score DESC) AS rn,* FROM window_test) AS t (5,'digoal','化学',98.5), (6,'刘德华','数学',89.5), (7,'刘德华','语文',99.5), (8,'刘德华','英语',79.5), WHERE rn=1 ORDER BY SUBJECT; (9,'刘德华','物理',89.5), (10,'刘德华','化学',69.5),

id | name | subject | score SELECT * FROM tbl WHERE a COLLATE "C" > 'foo'; 5 | digoal | 化学 | 98.5 scalar subquery 1 | digoal | 数学 | 99.5 SELECT name, (SELECT max(pop) FROM cities WHERE cities.state = states.name) 4 | digoal | 物理 | 99.5 13 | 张学友 | 英语 | 92.5 FROM states; 7 | 刘德华 | 语文 | 99.5 array constructor type cast ARRAY[] 前面有例子 row constructor collation expression ROW() SELECT a, b, c FROM tbl WHERE ... ORDER BY a COLLATE "C"; SELECT * FROM tbl WHERE a > 'foo' COLLATE "C";

SQL Syntax-Function Call
创建 CREATE OR REPLACE FUNCTION f_test(i_left numeric, i_right numeric) RETURNS numeric AS $$ DECLARE BEGIN RETURN i_left * i_right; END; $$ LANGUAGE plpgsql; 调用函数的几种方法 Positional Notation SELECT f_test(10, 2.5); Named Notation SELECT f_test(i_left := 10, i_right := 2.5); Mixed Notation SELECT f_test(10, i_right := 2.5);

Data Definition

Constraint foreign key (columns) reference table(columns), 唯一约束或pk必须与
check CREATE TABLE products ( foreign key product_no integer, CREATE TABLE a(c1 text,c2 text,UNIQUE (c1,c2)); name text, price numeric, CREATE TABLE b(c1 text,c2 text,FOREIGN KEY(c1,c2) REFERENCES a(c1,c2)); CHECK (price > 0), discounted_price numeric, CHECK (discounted_price > 0), CREATE TABLE a(c1 text UNIQUE,c2 text UNIQUE); CHECK (price > discounted_price) ); not null / unique / primary key / foreign key CREATE TABLE a(c1 text,c2 text,PRIMARY KEY (c1,c2)); ERROR: there is no unique constraint matching given keys for referenced table "a" CREATE TABLE a(c1 text NOT NULL,c2 text NOT NULL,UNIQUE (c1,c2)); foreign key (columns) reference table(columns), 唯一约束或pk必须与foreign key的表一致. 复合对复合, 单个对单个. 不能混用. REFERENCES reftable [ ( refcolumn ) ] [ MATCH matchtype ] [ ON DELETE action ] [ ON UPDATE action ] (column constraint) FOREIGN KEY ( column [, ... ] ) REFERENCES reftable [ ( refcolumn [, ... ] ) ] [ MATCH matchtype ] [ ON DELETE action ] [ ON UPDATE action ] (table constraint)These clauses specify a foreign key constraint, which requires that a group of one or more columns of the new table must only contain values that match values in the referenced column(s) of some row of the referenced table. If refcolumn is omitted, the primary key of the reftable is used. The referenced columns must be the columns of a non-deferrable unique or primary key constraint in the referenced table. Note that foreign key constraints cannot be defined between temporary tables and permanent tables. A value inserted into the referencing column(s) is matched against the values of the referenced table and referenced columns using the given match type. There are three match types: MATCH FULL, MATCH PARTIAL, and MATCH SIMPLE, which is also the default. MATCH FULL will not allow one column of a multicolumn foreign key to be null unless all foreign key columns are null. MATCH SIMPLE allows some foreign key columns to be null while other parts of the foreign key are not null. MATCH PARTIAL is not yet implemented. In addition, when the data in the referenced columns is changed, certain actions are performed on the data in this table's columns. The ON DELETE clause specifies the action to perform when a referenced row in the referenced table is being deleted. Likewise, the ON UPDATE clause specifies the action to perform when a referenced column in the referenced table is being updated to a new value. If the row is updated, but the referenced column is not actually changed, no action is done. Referential actions other than the NO ACTION check cannot be deferred, even if the constraint is declared deferrable. There are the following possible actions for each clause: NO ACTIONProduce an error indicating that the deletion or update would create a foreign key constraint violation. If the constraint is deferred, this error will be produced at constraint check time if there still exist any referencing rows. This is the default action. RESTRICTProduce an error indicating that the deletion or update would create a foreign key constraint violation. This is the same as NO ACTION except that the check is not deferrable. CASCADEDelete any rows referencing the deleted row, or update the value of the referencing column to the new value of the referenced column, respectively. SET NULLSet the referencing column(s) to null. SET DEFAULTSet the referencing column(s) to their default values. If the referenced column(s) are changed frequently, it might be wise to add an index to the foreign key column so that referential actions associated with the foreign key column can be performed more efficiently. DEFERRABLE NOT DEFERRABLEThis controls whether the constraint can be deferred. A constraint that is not deferrable will be checked immediately after every command. Checking of constraints that are deferrable can be postponed until the end of the transaction (using the SET CONSTRAINTS command). NOT DEFERRABLE is the default. Currently, only UNIQUE, PRIMARY KEY,EXCLUDE, and REFERENCES (foreign key) constraints accept this clause. NOT NULL and CHECK constraints are not deferrable. INITIALLY IMMEDIATE INITIALLY DEFERREDIf a constraint is deferrable, this clause specifies the default time to check the constraint. If the constraint is INITIALLY IMMEDIATE, it is checked after each statement. This is the default. If the constraint is INITIALLY DEFERRED, it is checked only at the end of the transaction. The constraint check time can be altered with the SET CONSTRAINTScommand. foreign key (columns) reference table(columns), 唯一约束或pk必须与 foreign key的表一致.复合对复合, 单个对单个. 不能混用.

Constraint foreign key
FOREIGN KEY ( column_name [, ... ] ) REFERENCES reftable [ ( refcolumn [, ... ] ) ] [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ] [ ON DELETE action ] [ ON UPDATE action ] } [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ] MATCH FULL will not allow one column of a multicolumn foreign key to be null unless all foreign key columns are null. MATCH SIMPLE allows some foreign key columns to be null while other parts of the foreign key are not null. NO ACTION (default, deferrable enabled) RESTRICT (like NO ACTION , deferrable disabled) CASCADE SET NULL SET DEFAULT Restricting and cascading deletes are the two most common options. RESTRICT prevents deletion of a referenced row. NO ACTION means that if any referencing rows still exist when the constraint is checked, an error is raised; this is the default behavior if you do not specify anything. (The essential difference between these two choices is that NO ACTIONallows the check to be deferred until later in the transaction, whereas RESTRICT does not.) CASCADE specifies that when a referenced row is deleted, row(s) referencing it should be automatically deleted as well. There are two other options: SET NULL and SET DEFAULT. These cause the referencing columns to be set to nulls or default values, respectively, when the referenced row is deleted. Note that these do not excuse you from observing any constraints. For example, if an action specifies SET DEFAULT but the default value would not satisfy the foreign key, the operation will fail. Analogous to ON DELETE there is also ON UPDATE which is invoked when a referenced column is changed (updated). The possible actions are the same. Since a DELETE of a row from the referenced table or an UPDATE of a referenced column will require a scan of the referencing table for rows matching the old value, it is often a good idea to index the referencing columns. Because this is not always needed, and there are many choices available on how to index, declaration of a foreign key constraint does not automatically create an index on the referencing columns.

Constraint c3 c1 c2 更高级的约束用法, 例如exclusion约束
CREATE TABLE test(id int,geo circle,EXCLUDE USING GIST (geo WITH pg_catalog.&&)); INSERT INTO test values(1,'<(0,0),2>'::circle); INSERT INTO test values(1,'<(4.1,0),2>'::circle); INSERT INTO test values(1,'<(-1.9,0),2>'::circle); ERROR: conflicting key value violates exclusion constraint "test_geo_excl" DETAIL: Key (geo)=(<(-1.9,0),2>) conflicts with existing key (geo)=(<(0,0),2>). EXCLUDE [ USING index_method ] ( exclude_element WITH operator [, ... ] ) index_parameters [ WHERE ( predicate ) ]The EXCLUDE clause defines an exclusion constraint, which guarantees that if any two rows are compared on the specified column(s) or expression(s) using the specified operator(s), not all of these comparisons will return TRUE. If all of the specified operators test for equality, this is equivalent to a UNIQUE constraint, although an ordinary unique constraint will be faster. However, exclusion constraints can specify constraints that are more general than simple equality. For example, you can specify a constraint that no two rows in the table contain overlapping circles (see Section 8.8) by using the && operator. Exclusion constraints are implemented using an index, so each specified operator must be associated with an appropriate operator class (see Section 11.9) for the index access method index_method. The operators are required to be commutative. Each exclude_element can optionally specify an operator class and/or ordering options; these are described fully under CREATE INDEX. The access method must support amgettuple (see Chapter 52); at present this means GIN cannot be used. Although it's allowed, there is little point in using B-tree or hash indexes with an exclusion constraint, because this does nothing that an ordinary unique constraint doesn't do better. So in practice the access method will always beGiST. The predicate allows you to specify an exclusion constraint on a subset of the table; internally this creates a partial index. Note that parentheses are required around the predicate. c3 c1 c2

System Column oid (4 bytes)

System Column oid主要被用于系统表中
Object identifiers (OIDs) are used internally by PostgreSQL as primary keys for various system tables. oid不能确保跨系统表的唯一性. obj_description(123456,'pg_class') would retrieve the comment for the table with OID 因此单个参数的obj_description(oid) 已经废弃

System Column tableoid (4 Bytes)
postgres=# CREATE TABLE test (id int); postgres=# CREATE SEQUENCE seq_test START WITH 1; postgres=# INSERT INTO test VALUES(1); postgres=# SELECT tableoid FROM test; 26534 postgres=# SELECT tableoid FROM seq_test LIMIT 1; 26537 postgres=# SELECT relname FROM pg_class WHERE oid IN (26534, 26537); relname test seq_test

System Column Heap Table's PAGE(0) it1 it2 it3 it4 it5 it6 tup6 tup5
ctid (6 Bytes) xmin / xmax / cmin / cmax (4 Bytes) Heap Table's PAGE(0) it1 it2 it3 it4 it5 it6 tup6 tup5 tup4 tup3 tup2 tup1

System Column 老版本记录老版本记录新版本记录

Modify Table add column 有默认值, 将rewrite全表, 包括索引重建. 因此有排他锁, 谨慎操作.
drop column pg_attribute Recover droped column from PostgreSQL Can session_replication_role used like MySQL's BlackHole Engine? ALTER TABLE的语法

Modify Table ALTER TABLE的语法

Privilege 创建角色要我干活, 给我权限 ROLE IN ROLE | GROUP, 把新增角色加盟到已有角色.
USER | ROLE, 把已有角色加盟到新增角色. ADMIN, 已有角色加盟时带上WITH ADMIN OPTION选项.

Privilege Cluster Database(s) Schema(s) Function(s) Language(s)
Table(s) Index(es) Sequence(s) Other Obj(s) 要我干活, 给我权限 ROLE

Privilege Database, 数据库的下一层是SCHEMA, 所以给数据库CREATE权限是有了在这个数据库创建SCHEMA的权限. TEMP指允许在该库创建临时表. GRANT { { CREATE | CONNECT | TEMPORARY | TEMP } [, ...] | ALL [ PRIVILEGES ] } ON DATABASE database_name [, ...] TO { [ GROUP ] role_name | PUBLIC } [, ...] [ WITH GRANT OPTION ] Schema GRANT { { CREATE | USAGE } [, ...] | ALL [ PRIVILEGES ] } ON SCHEMA schema_name [, ...] Tablespace GRANT { CREATE | ALL [ PRIVILEGES ] } ON TABLESPACE tablespace_name [, ...]

Privilege Table GRANT { { SELECT | INSERT | UPDATE | DELETE | TRUNCATE | REFERENCES | TRIGGER } [, ...] | ALL [ PRIVILEGES ] } ON { [ TABLE ] table_name [, ...] | ALL TABLES IN SCHEMA schema_name [, ...] } TO { [ GROUP ] role_name | PUBLIC } [, ...] [ WITH GRANT OPTION ] Column, PG比较强大的地方, 允许对列赋权. GRANT { { SELECT | INSERT | UPDATE | REFERENCES } ( column [, ...] ) [, ...] | ALL [ PRIVILEGES ] ( column [, ...] ) } ON [ TABLE ] table_name [, ...] 列和表的权限比较容易混淆

Privilege Language GRANT { USAGE | ALL [ PRIVILEGES ] }
ON LANGUAGE lang_name [, ...] TO { [ GROUP ] role_name | PUBLIC } [, ...] [ WITH GRANT OPTION ] Function GRANT { EXECUTE | ALL [ PRIVILEGES ] } ON { FUNCTION function_name ( [ [ argmode ] [ arg_name ] arg_type [, ...] ] ) [, ...] | ALL FUNCTIONS IN SCHEMA schema_name [, ...] }

Privilege Sequence GRANT { { USAGE | SELECT | UPDATE }
[, ...] | ALL [ PRIVILEGES ] } ON { SEQUENCE sequence_name [, ...] | ALL SEQUENCES IN SCHEMA schema_name [, ...] } TO { [ GROUP ] role_name | PUBLIC } [, ...] [ WITH GRANT OPTION ] Role GRANT role_name [, ...] TO role_name [, ...] [ WITH ADMIN OPTION ] (INHERIT | NOINHERIT 的区别) PostgreSQL Role Membership 详见

Schema Schema Search Path SHOW search_path; search_path
"$user",public use qualified name: schema.table database.schema.table (如果用这种写法, 必须写当前连接的库名) The search path works in the same way for data type names, function names, and operator names as it does for table names. Data type and function names can be qualified in exactly the same way as table names. If you need to write a qualified operator name in an expression, there is a special provision: you must write : OPERATOR(schema.operator) SELECT 3 OPERATOR(pg_catalog.+) 4;

Schema public Schema REVOKE CREATE ON SCHEMA public FROM PUBLIC;
System Catalog Schema pg_catalog is always effectively part of the search path. If it is not named explicitly in the path then it is implicitly searched before searching the path's schemas. However, you can explicitly place pg_catalog at the end of your search path if you prefer to have user-defined names override built-in names. SET search_path="$user",public,pg_catalog; 为每个用户创建与之同名的SCHEMA, 便于移植. 不建议使用public SCHEMA.

Inheritance 单表约束一个表可以继承多个表一个表可以被多个表继承允许多级继承不允许闭环继承
约束, FOREIGN KEY, UNIQUE KEY, PRIMARY KEY, CHECK, NOT NULL都只约束单表. foreign key 优化, 建索引. 单表约束

Inheritance "全局约束的函数" 查询主表以及所有子表全局约束举例1(不可行的做法, 并发时将无法确保唯一, 后面有例子)
ALTER TABLE p ADD CONSTRAINT ck_pk_p CHECK(check_pk_p(id) IS true); "全局约束的函数"

Inheritance 主表以及所有子表全局约束举例2(可行, 但仅仅适用于分区字段, 其他字段需要全局唯一怎么办? 可以考虑使用增加反向关系表, 需要全局唯一的字段作为分区字段, 或者使用触发器) 分段 create table p (id int primary key,name text unique,info text); create table c0(like p including all) inherits(p); create table c1(like p including all) inherits(p); alter table p add constraint ck_p_1 check(false); alter table c0 add constraint ck_c0_1 check(mod(id,2)=0); alter table c0 add constraint ck_c0_2 check(mod(hashtext(name),2)=0); alter table c1 add constraint ck_c1_1 check(mod(id,2)=1); alter table c1 add constraint ck_c1_2 check(mod(hashtext(name),2)=1); 注意 DELETE, UPDATE, SELECT 父表时, 默认不加ONLY, 影响所有子表和目标表. INSERT 没有ONLY选项, 也只会影响目标表. 除非有RULE或TRIGGER.

Partition 可以用rule或trigger实现分区表(Range, Hash, List, Complex)
rule不被COPY触发, 并且规则异常时插入的数据将插入主表. 无法简单处理此类问题. rule是每QUERY触发, 所以更适合bulk insert场景. trigger分区方法举例 : 主表: CREATE TABLE p ( city_id int not null, logtime timestamp(0) not null, peaktemp int, unitsales int ); 分区字段索引: CREATE INDEX idx_p_logtime ON p (logtime); 子表: CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p ( city_id int not null, logtime timestamp(0) not null, peaktemp int, unitsales int ); CREATE INDEX idx_p_logtime ON p (logtime); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_ (LIKE p INCLUDING all) INHERITS (p); CREATE TABLE p_default (LIKE p INCLUDING all) INHERITS (p); ALTER TABLE p_ ADD CONSTRAINT p_201201_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201202_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201203_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201204_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201205_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201206_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201207_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201208_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201209_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201210_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201211_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201212_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_default ADD CONSTRAINT p_default_ck1 CHECK (logtime<date ' ' or logtime>=date ' '); postgres=# \d p_201201 Table "public.p_201201" Column | Type | Modifiers city_id | integer | not null logtime | timestamp(0) without time zone | not null peaktemp | integer | unitsales | integer | Indexes: "p_201201_logtime_idx" btree (logtime) Check constraints: "p_201201_ck1" CHECK (logtime >= ' '::date AND logtime < ' '::date) Inherits: p CREATE OR REPLACE FUNCTION p_insert_trigger() RETURNS TRIGGER AS $$ BEGIN IF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' OR NEW.logtime < DATE ' ' ) THEN INSERT INTO p_default VALUES (NEW.*); ELSE RAISE EXCEPTION 'Date out of range. Fix the p_insert_trigger() function!'; END IF; RETURN NULL; END; $$ LANGUAGE plpgsql; CREATE OR REPLACE FUNCTION p_delete_trigger() IF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' OR OLD.logtime < DATE ' ' ) THEN DELETE FROM p_default WHERE logtime=OLD.logtime; CREATE TRIGGER insert_p_trigger BEFORE INSERT ON p FOR EACH ROW EXECUTE PROCEDURE p_insert_trigger(); CREATE TRIGGER delete_p_trigger BEFORE DELETE ON p FOR EACH ROW EXECUTE PROCEDURE p_delete_trigger(); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (1, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (2, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (3, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (4, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (5, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (6, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (7, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (8, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (9, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (10, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (11, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (12, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (13, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (14, timestamp ' :59:59', 20, 10); SELECT t1.relname,t2.* FROM p t2,pg_class t1 WHERE t2.tableoid=t1.oid ORDER BY t2.logtime; relname | city_id | logtime | peaktemp | unitsales p_default | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_default | | :59:59 | | postgres=# EXPLAIN UPDATE p SET unitsales=unitsales+1 WHERE logtime=timestamp ' :59:59'; QUERY PLAN Update on p (cost= rows=9 width=26) -> Seq Scan on p (cost= rows=1 width=26) Filter: (logtime = ' :59:59'::timestamp without time zone) -> Bitmap Heap Scan on p_default p (cost= rows=8 width=26) Recheck Cond: (logtime = ' :59:59'::timestamp without time zone) -> Bitmap Index Scan on p_default_logtime_idx (cost= rows=8 width=0) Index Cond: (logtime = ' :59:59'::timestamp without time zone) UPDATE p SET unitsales=unitsales+1 WHERE logtime=timestamp ' :59:59'; p_default | | :59:59 | | postgres=# EXPLAIN DELETE FROM p WHERE logtime=timestamp ' :59:59'; Delete on p (cost= rows=9 width=6) -> Seq Scan on p (cost= rows=1 width=6) -> Bitmap Heap Scan on p_default p (cost= rows=8 width=6) DELETE FROM p WHERE logtime=timestamp ' :59:59'; DELETE 1 postgres=# EXPLAIN SELECT * FROM p WHERE logtime=timestamp ' :59:59'; Result (cost= rows=9 width=20) -> Append (cost= rows=9 width=20) -> Seq Scan on p (cost= rows=1 width=20) -> Bitmap Heap Scan on p_default p (cost= rows=8 width=20) postgres=# select proname,provolatile,proargtypes from pg_proc where prorettype in (select oid from pg_type where typname ~ 'timestamp') order by proargtypes; proname | provolatile | proargtypes transaction_timestamp | s | statement_timestamp | s | pg_stat_get_bgwriter_stat_reset_time | s | pg_conf_load_time | s | pg_postmaster_start_time | s | pg_last_xact_replay_timestamp | v | clock_timestamp | v | now | s | postgres=# show constraint_exclusion; constraint_exclusion partition postgres=# EXPLAIN SELECT * FROM p WHERE logtime=now(); Result (cost= rows=105 width=20) -> Append (cost= rows=105 width=20) Filter: (logtime = now()) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) Recheck Cond: (logtime = now()) -> Bitmap Index Scan on p_201201_logtime_idx (cost= rows=8 width=0) Index Cond: (logtime = now()) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201202_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201203_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201204_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201205_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201206_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201207_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201208_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201209_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201210_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201211_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201212_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_default p (cost= rows=8 width=20) postgres=# ALTER FUNCTION now() IMMUTABLE; ALTER FUNCTION -- 同上 postgres=# ALTER FUNCTION now() VOLATILE; Result (cost= rows=105 width=20) -> Append (cost= rows=105 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_default p (cost= rows=8 width=20) (30 rows) Thinking PostgreSQL Function's Volatility Categories

Partition CREATE TABLE p_201205 (LIKE p INCLUDING all) INHERITS (p);
CREATE TABLE p_default (LIKE p INCLUDING all) INHERITS (p); 子表分区字段约束 ALTER TABLE p_ ADD CONSTRAINT p_201201_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201202_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201203_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201204_ck1 CHECK (logtime>=date ' ' and logtime<date ' ');

Partition ALTER TABLE p_ ADD CONSTRAINT p_201205_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201206_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201207_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201208_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201209_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201210_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201211_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_ ADD CONSTRAINT p_201212_ck1 CHECK (logtime>=date ' ' and logtime<date ' '); ALTER TABLE p_default ADD CONSTRAINT p_default_ck1 CHECK (logtime<date ' ' or logtime>=date ' ');

Partition 插入触发器函数 CREATE OR REPLACE FUNCTION p_insert_trigger()
RETURNS TRIGGER AS $$ BEGIN IF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*);

Partition ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' AND NEW.logtime < DATE ' ' ) THEN INSERT INTO p_ VALUES (NEW.*); ELSIF ( NEW.logtime >= DATE ' ' OR NEW.logtime < DATE ' ' ) THEN INSERT INTO p_default VALUES (NEW.*); ELSE RAISE EXCEPTION 'Date out of range. Fix the p_insert_trigger() function!'; END IF; RETURN NULL; END; $$ LANGUAGE plpgsql;

Partition 删除触发器函数 CREATE OR REPLACE FUNCTION p_delete_trigger()
RETURNS TRIGGER AS $$ BEGIN IF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime;

Partition ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' AND OLD.logtime < DATE ' ' ) THEN DELETE FROM p_ WHERE logtime=OLD.logtime; ELSIF ( OLD.logtime >= DATE ' ' OR OLD.logtime < DATE ' ' ) THEN DELETE FROM p_default WHERE logtime=OLD.logtime; ELSE RAISE EXCEPTION 'Date out of range. Fix the p_insert_trigger() function!'; END IF; RETURN NULL; END; $$ LANGUAGE plpgsql;

Partition 创建插入触发器 CREATE TRIGGER insert_p_trigger BEFORE INSERT ON p
FOR EACH ROW EXECUTE PROCEDURE p_insert_trigger(); 创建删除触发器 CREATE TRIGGER delete_p_trigger BEFORE DELETE ON p FOR EACH ROW EXECUTE PROCEDURE p_delete_trigger(); 插入测试数据, 定向到每个分区表 INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (1, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (2, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (3, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (4, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (5, timestamp ' :59:59', 20, 10);

Partition INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (6, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (7, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (8, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (9, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (10, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (11, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (12, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (13, timestamp ' :59:59', 20, 10); INSERT INTO p (city_id, logtime, peaktemp, unitsales) VALUES (14, timestamp ' :59:59', 20, 10);

Partition 查询是否正确定向到子表
SELECT t1.relname,t2.* FROM p t2,pg_class t1 WHERE t2.tableoid=t1.oid ORDER BY t2.logtime; relname | city_id | logtime | peaktemp | unitsales p_default | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_ | | :59:59 | | p_default | | :59:59 | |

Partition 分区表优化 constraint_exclusion = partition # on, off, or partition on examine constraints for all tables off never examine constraints partition examine constraints only for inheritance child tables and UNION ALL subqueries

Partition 更新操作执行计划 postgres=# EXPLAIN UPDATE p SET unitsales=unitsales+1 WHERE logtime=timestamp ' :59:59'; QUERY PLAN Update on p (cost= rows=9 width=26) -> Seq Scan on p (cost= rows=1 width=26) Filter: (logtime = ' :59:59'::timestamp without time zone) -> Bitmap Heap Scan on p_default p (cost= rows=8 width=26) Recheck Cond: (logtime = ' :59:59'::timestamp without time zone) -> Bitmap Index Scan on p_default_logtime_idx (cost= rows=8 width=0) Index Cond: (logtime = ' :59:59'::timestamp without time zone) UPDATE p SET unitsales=unitsales+1 WHERE logtime=timestamp ' :59:59'; relname | city_id | logtime | peaktemp | unitsales p_default | | :59:59 | |

Partition 删除操作执行计划 postgres=# EXPLAIN DELETE FROM p WHERE logtime=timestamp ' :59:59'; QUERY PLAN Delete on p (cost= rows=9 width=6) -> Seq Scan on p (cost= rows=1 width=6) Filter: (logtime = ' :59:59'::timestamp without time zone) -> Bitmap Heap Scan on p_default p (cost= rows=8 width=6) Recheck Cond: (logtime = ' :59:59'::timestamp without time zone) -> Bitmap Index Scan on p_default_logtime_idx (cost= rows=8 width=0) Index Cond: (logtime = ' :59:59'::timestamp without time zone) DELETE FROM p WHERE logtime=timestamp ' :59:59'; DELETE 1

Partition 查询操作执行计划 postgres=# EXPLAIN SELECT * FROM p WHERE logtime=timestamp ' :59:59'; QUERY PLAN Result (cost= rows=9 width=20) -> Append (cost= rows=9 width=20) -> Seq Scan on p (cost= rows=1 width=20) Filter: (logtime = ' :59:59'::timestamp without time zone) -> Bitmap Heap Scan on p_default p (cost= rows=8 width=20) Recheck Cond: (logtime = ' :59:59'::timestamp without time zone) -> Bitmap Index Scan on p_default_logtime_idx (cost= rows=8 width=0) Index Cond: (logtime = ' :59:59'::timestamp without time zone)

Partition 分区字段作为WHERE条件使用时, 使用函数或常量作为过滤条件的执行计划区别
postgres=# select proname,provolatile,proargtypes from pg_proc where prorettype in (select oid from pg_type where typname ~ 'timestamp') order by proargtypes; proname | provolatile | proargtypes transaction_timestamp | s | statement_timestamp | s | pg_stat_get_bgwriter_stat_reset_time | s | pg_conf_load_time | s | pg_postmaster_start_time | s | pg_last_xact_replay_timestamp | v | clock_timestamp | v | now | s | postgres=# show constraint_exclusion; constraint_exclusion partition

Partition postgres=# EXPLAIN SELECT * FROM p WHERE logtime=now();
QUERY PLAN Result (cost= rows=105 width=20) -> Append (cost= rows=105 width=20) -> Seq Scan on p (cost= rows=1 width=20) Filter: (logtime = now()) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) Recheck Cond: (logtime = now()) -> Bitmap Index Scan on p_201201_logtime_idx (cost= rows=8 width=0) Index Cond: (logtime = now()) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201202_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201203_logtime_idx (cost= rows=8 width=0)

Partition Index Cond: (logtime = now())
-> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) Recheck Cond: (logtime = now()) -> Bitmap Index Scan on p_201204_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201205_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201206_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201207_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20)

Partition Recheck Cond: (logtime = now())
-> Bitmap Index Scan on p_201208_logtime_idx (cost= rows=8 width=0) Index Cond: (logtime = now()) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201209_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201210_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201211_logtime_idx (cost= rows=8 width=0) -> Bitmap Heap Scan on p_ p (cost= rows=8 width=20) -> Bitmap Index Scan on p_201212_logtime_idx (cost= rows=8 width=0)

Partition Index Cond: (logtime = now())
-> Bitmap Heap Scan on p_default p (cost= rows=8 width=20) Recheck Cond: (logtime = now()) -> Bitmap Index Scan on p_default_logtime_idx (cost= rows=8 width=0) 更改函数稳定性 postgres=# ALTER FUNCTION now() IMMUTABLE; ALTER FUNCTION postgres=# EXPLAIN SELECT * FROM p WHERE logtime=now(); -- 同上 postgres=# ALTER FUNCTION now() VOLATILE; QUERY PLAN Result (cost= rows=105 width=20)

Partition -> Append (cost=0.00..447.85 rows=105 width=20)
-> Seq Scan on p (cost= rows=1 width=20) Filter: (logtime = now()) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20)

Partition Filter: (logtime = now())
-> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_ p (cost= rows=8 width=20) -> Seq Scan on p_default p (cost= rows=8 width=20) (30 rows) 函数稳定性 Thinking PostgreSQL Function's Volatility Categories

Partition 分区表使用注意事项尽量将分区功能移至应用端代码中. constraint_exclusion = partition
WHERE条件带有分区字段作为过滤条件时, 传入的参数必须使用constant才能获得良好的执行计划简化分区规则, 分区字段上使用简单的b-tree索引, 尽量避免函数索引. 使用数据库分区的潜在问题 CPU开销(触发器或rule, 硬解析) PostgreSQL partition table's arithmetic tuning example

Foreign Data Foreign data wrapper
A foreign data wrapper is a library that can communicate with an external data source, hiding the details of connecting to the data source and fetching data from it. There is a foreign data wrapper available as a contrib module, which can read plain data files residing on the server. Other kind of foreign data wrappers might be found as third party products.

Foreign Data TABLE DEFINE AUTH INFO Conn INFO API External Data Source
Foreign Table(s) NOT NEED Server(s) FDW File User Mapping(s) Oracle MySQL PostgreSQL Redis FDW(s) Others TABLE DEFINE AUTH INFO Conn INFO API External Data Source

Foreign Data PostgreSQL Foreign Table - pgsql_fdw
PostgreSQL Foreign Table - oracle_fdw 1 PostgreSQL Foreign Table - oracle_fdw 2 PostgreSQL Foreign Table - oracle_fdw 3 PostgreSQL Foreign Table - file_fdw PostgreSQL Foreign Table - redis_fdw PostgreSQL Foreign Table - mysql_fdw 1 PostgreSQL Foreign Table - mysql_fdw 2

DML INSERT UPDATE DELETE 一个事务最大2^32条SQL(因为cmin,cmax的长度是4Bytes)
PostgreSQL一个事务中可以包含DML, DDL, DCL. 除了以下 create tablespace create database 使用concurrently并行创建索引其他未尽情况略 (Oracle执行DDL前自动将前面的未提交的事务提交,所以Oracle不支持在事务中执行DDL语句)

Query JOIN ALIAS Table as Function's Return data type
GROUP BY [ HAVING ] DISTINCT COMBINING QUERY SORT LIMIT [ OFFSET ] WITH

JOIN T1 CROSS JOIN T2 ( T1 INNER JOIN T2 ON TRUE )
T1 { [INNER] | { LEFT | RIGHT | FULL } [OUTER] } JOIN T2 ON boolean_expression T1 { [INNER] | { LEFT | RIGHT | FULL } [OUTER] } JOIN T2 USING ( join column list ) T1 NATURAL { [INNER] | { LEFT | RIGHT | FULL } [OUTER] } JOIN T2 The words INNER and OUTER are optional in all forms. INNER is the default; LEFT, RIGHT, and FULL imply an outer join.

JOIN CROSS JOIN 产生笛卡尔积因为关联的两个表有两列同名同类型同长度, 所以以下三种写法等同

JOIN 左或右连接, 不满足条件的右或左表的值置空全关联,不满足条件的值都置空

ALIAS table alias: FROM table_reference AS alias
FROM table_reference alias colum alias: SELECT expression [ [ AS ] output_name ] subquery alias: FROM (SELECT * FROM table1) AS alias_name

Table as Function's Return data type
return table's row type create table t1 (id int,name text,crt_time timestamp(0)); create or replace function f_t1 (i_id int) returns setof t1 as $$ declare begin return query select * from t1 where id=i_id; return; end; $$ language plpgsql; insert into t1 values(1,'digoal',now()); insert into t1 values(1,'DIGOAL',now()); select * from f_t1(1); id | name | crt_time 1 | digoal | :15:09 1 | DIGOAL | :15:15

return composite type create type type1 as (id int,name text,crt_time timestamp(0)); create or replace function f_type1 (i_id int) returns setof type1 as $$ declare begin return query select * from t1 where id=i_id; return; end; $$ language plpgsql; select * from f_type1(1); id | name | crt_time 1 | digoal | :15:09 1 | DIGOAL | :15:15

return record create or replace function f_record1 (i_id int) returns setof record as $$ declare begin return query select * from t1 where id=i_id; return; end; $$ language plpgsql; select * from f_record1(1) as (id int,name text,crt_time timestamp(0)); id | name | crt_time 1 | digoal | :15:09 1 | DIGOAL | :15:15

DISTINCT SELECT DISTINCT select_list ... (NULL在DISTINCT [ON] 中视为相等)
postgres=# select * from t1 ; id | name | crt_time 1 | digoal | :15:09 1 | DIGOAL | :15:15 postgres=# select distinct id from t1; id 1 SELECT DISTINCT ON (expression [, expression ...]) select_list ... Here expression is an arbitrary value expression that is evaluated for all rows. A set of rows for which all the expressions are equal are considered duplicates, and only the first row of the set is kept in the output. Note that the "first row" of a set is unpredictable unless the query is sorted on enough columns to guarantee a unique ordering of the rows arriving at the DISTINCT filter. (DISTINCT ON processing occurs after ORDER BY sorting.)

DISTINCT postgres=# select distinct on (id) id,name,crt_time from t1 ;
1 | digoal | :15:09 postgres=# select distinct on (id) id,name,crt_time from t1 order by crt_time; ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 1: select distinct on (id) id,name,crt_time from t1 order by cr... ^ postgres=# select distinct on (id) id,name,crt_time from t1 order by id;

DISTINCT postgres=# select distinct on (id,name) id,name,crt_time from t1 order by id,crt_time desc; ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 1: select distinct on (id,name) id,name,crt_time from t1 order ... ^ postgres=# select distinct on (id,name) id,name,crt_time from t1 order by id,name,crt_time desc; id | name | crt_time 1 | DIGOAL | :15:15 1 | digoal | :15:09

DISTINCT 使用DISTINCT ON实现前面章节用窗口函数实现的取第一名的功能
postgres=# CREATE TABLE window_test(id int, name text, subject text, score numeric); postgres=# INSERT INTO window_test VALUES (1,'digoal','数学',99.5), (2,'digoal','语文',89.5), (3,'digoal','英语',79.5), (4,'digoal','物理',99.5), (5,'digoal','化学',98.5), (6,'刘德华','数学',89.5), (7,'刘德华','语文',99.5), (8,'刘德华','英语',79.5), (9,'刘德华','物理',89.5), (10,'刘德华','化学',69.5), (11,'张学友','数学',89.5), (12,'张学友','语文',91.5), (13,'张学友','英语',92.5), (14,'张学友','物理',93.5), (15,'张学友','化学',94.5); -- 取出每门课程的第一名. postgres=# select distinct on (subject) id,name,subject,score from window_test order by subject,score desc;

DISTINCT id | name | subject | score ----+--------+---------+-------
5 | digoal | 化学 | 98.5 1 | digoal | 数学 | 99.5 4 | digoal | 物理 | 99.5 13 | 张学友 | 英语 | 92.5 7 | 刘德华 | 语文 | 99.5 (5 rows) 与使用窗口函数得到的结果一致, 并且写法更简洁.

NULL IN ORDER BY and DISTINCT
order by和distinct处理NULLs时视为相等 postgres=# select 1 where null is null; postgres=# select 1 where null=null; ?column? 1 (0 rows) (1 row) postgres=# select 1 where null <> null; postgres=# select distinct name from t1; name ------

NULL IN ORDER BY and DISTINCT
postgres=# select * from t1 order by name,id; postgres=# select * from t1 order by crt_time,id; id | name | crt_time 1 | | :29:23 2 | | :29:26 3 | | :29:28 4 | | :29:32 5 | | :30:04

COMBINING QUERY query1 UNION [ALL] query2
query1 INTERSECT [ALL] query2 query1 EXCEPT [ALL] query2 UNION effectively appends the result of query2 to the result of query1 (although there is no guarantee that this is the order in which the rows are actually returned) INTERSECT returns all rows that are both in the result of query1 and in the result of query2. EXCEPT returns all rows that are in the result of query1 but not in the result of query2. Combining Query eliminates duplicate rows from its result, in the same way as DISTINCT, unless ALL is used query1 and query2 must return the same number of columns and the corresponding columns have compatible data types.

SORT SELECT select_list FROM table_expression
ORDER BY sort_expression1 [ASC | DESC] [NULLS { FIRST | LAST }] [, sort_expression2 [ASC | DESC] [NULLS { FIRST | LAST }] ...]

LIMIT [ OFFSET ] SELECT select_list FROM table_expression
[ ORDER BY ... ] [ LIMIT { number | ALL } ] [ OFFSET number ]

WITH(Common Table Expressions)
WITH regional_sales AS ( SELECT region, SUM(amount) AS total_sales FROM orders GROUP BY region ), top_regions AS ( SELECT region FROM regional_sales WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales) ) SELECT region, product, SUM(quantity) AS product_units, SUM(amount) AS product_sales WHERE region IN (SELECT region FROM top_regions) GROUP BY region, product;

WITH RECURSIVE t(n) AS ( VALUES (1) UNION ALL SELECT n+1 FROM t WHERE n < 100 ) SELECT sum(n) FROM t; 非递归子句 UNION [ALL] 递归子句 TEMP Working TABLE WITH语句的OUTPUT, 通过LIMIT可以跳出循环 "递归"SQL

UNION ALL 去重复(去重复时NULL 视为等同) 图中所有输出都涉及UNION [ALL]的操作, 包含以往返回的记录和当前返回的记录 1输出 2读取递归子句非递归子句 OUTPUT 4无,结束递归开始 TWT有无数据 3输出 4有,递归 6输出递归子句 5读取 TEMP Working TABLE 6同时输出 7 TWT清空并被替换循环 TEMP Intermediate TABLE (替换掉TWT的内容后清空自己)

TEMP Working Table 没有ctid, cmin, cmax, xmin, xmax, tableoid字段 postgres=# create table test (id int,name text); postgres=# insert into test values (1,'digoal1'),(2,'digoal2'); postgres=# begin; postgres=# with t1 as (update test set name='DIGOAL2' where id=2 returning *) select ctid from t1; ERROR: column "ctid" does not exist LINE 2: select ctid from t1; ^ postgres=# rollback; 其他字段(cmin,cmax,xmin,xmax,tableoid)同样错误

递归查询输出产品部以及该部门的所有子部门信息. 然后输出各个子部门以及各个子部门的人数 WITH RECURSIVE included_parts(sub_part, part, quantity) AS ( SELECT sub_part, part, quantity FROM parts WHERE part = '产品部' UNION ALL SELECT p.sub_part, p.part, p.quantity FROM included_parts pr, parts p WHERE p.part = pr.sub_part ) SELECT sub_part, SUM(quantity) as total_quantity FROM included_parts GROUP BY sub_part 初始输出第一步时读取的是初始输出, 后面都是TEMP Working TABLE

死循环 WITH RECURSIVE search_graph(id, link, data, depth) AS ( SELECT g.id, g.link, g.data, 1 FROM graph g UNION ALL SELECT g.id, g.link, g.data, sg.depth + 1 FROM graph g, search_graph sg WHERE g.id = sg.link ) SELECT * FROM search_graph; 每次递归输出的记录与以往的记录都不一样, TEMP Working Table 永远都有记录, 因此无限循环.

规避上一个死循环的方法让递归SQL有机会没记录输出 WITH RECURSIVE search_graph(id, link, data, depth, path, cycle) AS ( SELECT g.id, g.link, g.data, 1, ARRAY[g.id], false FROM graph g UNION ALL SELECT g.id, g.link, g.data, sg.depth + 1, path || g.id, g.id = ANY(path) FROM graph g, search_graph sg WHERE g.id = sg.link AND NOT cycle ) SELECT * FROM search_graph;

多值比较需使用ROW类型的ARRAY. WITH RECURSIVE search_graph(id, link, data, depth, path, cycle) AS ( SELECT g.id, g.link, g.data, 1, ARRAY[ROW(g.f1, g.f2)], false FROM graph g UNION ALL SELECT g.id, g.link, g.data, sg.depth + 1, path || ROW(g.f1, g.f2), ROW(g.f1, g.f2) = ANY(path) FROM graph g, search_graph sg WHERE g.id = sg.link AND NOT cycle ) SELECT * FROM search_graph;

还有什么情况可以跳出循环 WITH RECURSIVE t(n) AS ( SELECT 1 UNION ALL SELECT n+1 FROM t ) SELECT n FROM t LIMIT 100; 注意如果t表在外围被join了然后再limit的. 还死循环使用递归查询注意防止死循环

把属于产品部以及它的子部门的记录删除. WITH RECURSIVE included_parts(sub_part, part) AS ( SELECT sub_part, part FROM parts WHERE part = '产品部' UNION ALL SELECT p.sub_part, p.part FROM included_parts pr, parts p WHERE p.part = pr.sub_part ) DELETE FROM parts WHERE part IN (SELECT part FROM included_parts);

WITH的所有子句包括MAIN子句查看到的是一个SNAPSHOT. 各个子句对记录的变更相互看不到, 如果要看到变更的数据需使用RETURNING子句. WITH t AS ( UPDATE products SET price = price * 1.05 WHERE id = 10 RETURNING * ) SELECT * FROM products WHERE id = 10; SELECT * FROM t;

测试表 postgres=# create table test (id int,name text); CREATE TABLE postgres=# insert into test values(1,'digoal1'),(2,'digoal2'); 这样会看到老数据 postgres=# with t1 as (update test set name='NEW' where id=2 returning *) postgres-# select * from test where name='NEW'; id | name (0 rows) 这样才能看到新数据 postgres=# with t1 as (update test set name='NEWNEW' where id=2 returning *) postgres-# select * from t1 where name='NEWNEW'; id | name 2 | NEWNEW (1 row)

避免WITH子句包括MAIN子句修改同一条记录. 因为先执行哪条是不可预知的. 避免类似SQL: postgres=# create table test (id int,name text); postgres=# insert into test values (1,'digoal1'),(2,'digoal2'); postgres=# with t1 as (delete from test where id=1) postgres-# update test set name='DIGOAL1' where id=1; UPDATE 1 postgres=# select * from test where id=1; id | name 1 | DIGOAL1 The sub-statements in WITH are executed concurrently with each other and with the main query. Therefore, when using data-modifying statements in WITH, the order in which the specified updates actually happen is unpredictable. All the statements are executed with the same snapshot, so they cannot "see" each others' effects on the target tables. This alleviates the effects of the unpredictability of the actual order of row updates, and means that RETURNING data is the only way to communicate changes between different WITH sub-statements and the main query. Trying to update the same row twice in a single statement is not supported. Only one of the modifications takes place, but it is not easy (and sometimes not possible) to reliably predict which one. This also applies to deleting a row that was already updated in the same statement: only the update is performed. Therefore you should generally avoid trying to modify a single row twice in a single statement. In particular avoid writing WITH sub-statements that could affect the same rows changed by the main statement or a sibling sub-statement. The effects of such a statement will not be predictable. At present, any table used as the target of a data-modifying statement in WITH must not have a conditional rule, nor an ALSO rule, nor an INSTEAD rule that expands to multiple statements.

避免类似SQL: postgres=# with t1 as (update test set name='digoal1' where id=1) postgres-# delete from test where id=1; DELETE 1 postgres=# select * from test where id=1; id | name (0 rows) postgres=# with t1 as (update test set name='DIGOAL2' where id=2) postgres-# update test set name='NEW' WHERE id=2; UPDATE 1 postgres=# select * from test where id=2; id | name 2 | NEW 这段测试和手册不符, 手册上表示update和delete子句同时针对一条记录操作时, delete子句不会执行.

Data Type 强类型

Data Type 存储方法

Data Type 常用数据类型, 数字 Name Storage Size Description Range smallint
2 bytes small-range integer to integer 4 bytes typical choice for integer to bigint 8 bytes large-range integer to decimal / numeric variable user-specified precision, exact up to digits before the decimal point; up to digits after the decimal point real variable-precision, inexact 6 decimal digits precision double precision 15 decimal digits precision serial autoincrementing integer 1 to bigserial large autoincrementing integer 1 to numeric包含一个特殊值 'NaN' PostgreSQL 对NaN的定义做了调整 : IEEE754 specifies that NaN should not compare equal to any other floating-point value (including NaN). In order to allow floating-point values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaNvalues. 浮点型包含三个特殊值 : In addition to ordinary numeric values, the floating-point types have several special values: Infinity -Infinity NaN

Data Type Name Storage Size Description
常用数据类型, 字符 postgres=# create table char_test (c1 char(1),c2 "char"); postgres=# insert into char_test values('a','a'), ('数','数'); postgres=# select *,octet_length(c1),octet_length(c2) from char_test ; c1 | c2 | octet_length | octet_length a | a | | 数 | | | Name Storage Size Description character varying(n), varchar(n) variable(can store n chars) variable-length with limit character(n), char(n) n chars fixed-length, blank padded text variable variable unlimited length "char" 1 byte single-byte internal type name 64 bytes internal type for object names The name type exists only for the storage of identifiers in the internal system catalogs and is not intended for use by the general user. Its length is currently defined as 64 bytes (63 usable characters plus terminator) but should be referenced using the constantNAMEDATALEN in C source code. The length is set at compile time (and is therefore adjustable for special uses); the default maximum length might change in a future release.

Data Type 常用数据类型, 时间 Name Storage Size Description Low Value
High Value Resolution timestamp [ (p) ] [ without time zone ] 8 bytes both date and time (no time zone) 4713 BC AD 1 microsecond / 14 digits timestamp [ (p) ] with time zone both date and time, with time zone date 4 bytes date (no time of day) AD 1 day time [ (p) ] [ without time zone ] time of day (no date) 00:00:00 24:00:00 time [ (p) ] with time zone 12 bytes times of day only, with time zone 00:00: 24:00: interval [ fields ] [ (p) ] time interval years years time, timestamp, and interval accept an optional precision value p which specifies the number of fractional digits retained in the seconds field. By default, there is no explicit bound on precision. The allowed range of p is from 0 to 6 for the timestamp and interval types.

Data Type 常用数据类型, 时间时间输入输出格式 postgres=# set datestyle='SQL,DMY';
postgres=# select now(); 27/04/ :49: CST postgres=# set datestyle='SQL,MDY'; 04/27/ :50: CST DateStyle (string)Sets the display format for date and time values, as well as the rules for interpreting ambiguous date input values. For historical reasons, this variable contains two independent components: the output format specification (ISO, Postgres, SQL, or German) and the input/output specification for year/month/day ordering (DMY, MDY, or YMD). These can be set separately or together. The keywords Euro and European are synonyms for DMY; the keywords US, NonEuro, and NonEuropean are synonyms for MDY. See Section 8.5 for more information. The built-in default is ISO, MDY, but initdb will initialize the configuration file with a setting that corresponds to the behavior of the chosen lc_timelocale.

Data Type 常用数据类型, 时间时间间隔interval 格式
quantity unit [quantity unit...] [direction] P quantity unit [ quantity unit ...] [ T [ quantity unit ...]] P [ years-months-days ] [ T hours:minutes:seconds ] IntervalStyle样式 postgres=# show IntervalStyle ; postgres postgres=# select interval 'P-1Y-2M3DT-4H-5M-6S'; -1 years -2 mons +3 days -04:05:06 postgres=# select interval '1 day ago'; -1 days postgres=# set IntervalStyle ='sql_standard'; :05:06

Data Type Name Storage Size Description boolean 1 byte
常用数据类型, 布尔真 TRUE 't' 'true' 'y' 'yes' 'on' '1' 假 FALSE 'f' 'false' 'n' 'no' 'off' '0' unknown NULL Name Storage Size Description boolean 1 byte state of true or false

Data Type 常用数据类型, 枚举 CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
CREATE TABLE person (name text,current_mood mood); INSERT INTO person VALUES ('Moe', 'happy'); SELECT * FROM person WHERE current_mood = 'happy'; name | current_mood Moe | happy (1 row) -- 输入一个不存在的枚举值, 将报错 postgres=# SELECT * FROM person WHERE current_mood = 'happ'; ERROR: invalid input value for enum mood: "happ" -- 避免报错的方法, 把枚举转换成text postgres=# SELECT * FROM person WHERE current_mood::text = 'happ'; (0 rows)

Data Type 枚举类型变更 ALTER TYPE name ADD VALUE new_enum_value [ { BEFORE | AFTER } existing_enum_value ] This form adds a new value to an enum type. If the new value's place in the enum's ordering is not specified using BEFORE or AFTER, then the new item is placed at the end of the list of values. 注意事项, 添加枚举元素时尽量不要改动原来的元素的位置, 即尽量新增值插到最后. 否则可能会带来性能问题. ALTER TYPE ... ADD VALUE (the form that adds a new value to an enum type) cannot be executed inside a transaction block. Comparisons involving an added enum value will sometimes be slower than comparisons involving only original members of the enum type. This will usually only occur if BEFORE or AFTER is used to set the new value's sort position somewhere other than at the end of the list. However, sometimes it will happen even though the new value is added at the end (this occurs if the OID counter "wrapped around" since the original creation of the enum type). The slowdown is usually insignificant; but if it matters, optimal performance can be regained by dropping and recreating the enum type, or by dumping and reloading the database.

Data Type money类型显示和客户端参数lc_monetary有关 postgres=# show lc_monetary; C
postgres=# SELECT '12.345'::money; $12.35 postgres=# set lc_monetary='zh_CN'; ￥12.35

Data Type bytea类型 The bytea data type allows storage of binary strings
A binary string is a sequence of octets (or bytes) bytea与字符类型的区别 binary strings specifically allow storing octets of value zero and other "non-printable" octets. Character strings disallow zero octets, and also disallow any other octet values and sequences of octet values that are invalid according to the database's selected character set encoding. Second, operations on binary strings process the actual bytes, whereas the processing of character strings depends on locale settings. In short, binary strings are appropriate for storing data that the programmer thinks of as "raw bytes", whereas character strings are appropriate for storing text.

Data Type bytea类型同时支持两种格式输入 escape
select E'\\336\\255\\276\\357'::bytea; hex, 每两个16进制数字为一组, 表示一个"raw byte" SELECT E'\\x DE AD BE EF'::bytea; 支持两种格式输出, 需配置 9.0引入hex输出(通过配置bytea_output) 9.0以前为escape输出如果有从老版本数据库迁移到9.0及以后版本的情况, 需要注意, 可能再次与程序不兼容, 只需要将默认值调整为escape即可. 推荐使用hex格式输入输出 The "escape" format is the traditional PostgreSQL format for the bytea type. It takes the approach of representing a binary string as a sequence of ASCII characters, while converting those bytes that cannot be represented as an ASCII character into special escape sequences. If, from the point of view of the application, representing bytes as characters makes sense, then this representation can be convenient. But in practice it is usually confusing because it fuzzes up the distinction between binary strings and character strings, and also the particular escape mechanism that was chosen is somewhat unwieldy. So this format should probably be avoided for most new applications.

Data Type 几何类型

Data Type Network Address Types

Data Type digoal=> select id,generate_series(0,end_ip-start_ip)+start_ip from tbl_ip_info ; 2 | 2 | 2 | id | ?column? 2 | 2 | 1 | 2 | 1 | 3 | 1 | 3 | 1 | 3 | 1 | 3 | 1 | 3 | 1 | 3 | 1 | 3 | 2 | 3 | 2 | (24 rows)

Data Type Bit String Type
Bit strings are strings of 1's and 0's. They can be used to store or visualize bit masks. There are two SQL bit types: bit(n) and bit varying(n), where n is a positive integer. CREATE TABLE test (a BIT(3), b BIT VARYING(5)); INSERT INTO test VALUES (B'101', B'00'); INSERT INTO test VALUES (B'10', B'101'); ERROR: bit string length 2 does not match type bit(3) INSERT INTO test VALUES (B'10'::bit(3), B'101'); SELECT * FROM test; a | b 101 | 00 100 | 101

Data Type 全文检索类型 tsvector 去除重复分词后按分词顺序存储可以存储位置信息和权重信息 tsquery
存储查询的分词, 可存储权重信息

Data Type 全文检索类型 label * specify prefix matching

Data Type 输出格式: 输入格式: uuid
UUIDs could be generated by client applications or other libraries invoked through a server-side function. specifically a group of 8 digits followed by three groups of 4 digits followed by a group of 12 digits, for a total of 32 digits representing the 128 bits. 输出格式: 输入格式:

Data Type xml Use of this data type requires the installation to have been built with configure --with-libxml 构造xml类型的语法 SQL标准写法 XMLPARSE ( { DOCUMENT | CONTENT } value) 例如 XMLPARSE (DOCUMENT '<?xml version="1.0"?><book><title>Manual</title><chapter>...</chapter></book>') XMLPARSE (CONTENT 'abc<foo>bar</foo><bar>foo</bar>') PostgreSQL写法 xml '<foo>bar</foo>' '<foo>bar</foo>'::xml 从xml到字符串的转换 XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type ) XMLSERIALIZE ( CONTENT '<foo>bar</foo>'::xml AS text )

Data Type Array 不限长度目前PostgreSQL未对长度强限定, 如int[]和int[10]都不会限定元素个数.
array_length(ARRAY[[1,2,3,4,5],[6,7,8,9,10]], 1) 不限维度目前PostgreSQL未对维度强限定,如int[]和int[][], 效果是一样的, 都可以存储任意维度的数组. 矩阵强制多维数组中, 同一个维度的元素个数必须相同. 正确 array[[1,2,3,4],[5,6,7,8]] 不正确 array[[1,2,3,4],[5,6,7]] 元素强制元素类型必须一致 array[1,2,3] array[1,2,'abc']

Data Type Array 扩展一维数组支持prepend, append, cat操作
array_append(ARRAY['digoal','francs'],'david') array_prepend('david',ARRAY['digoal','francs']) 二维数组仅支持cat操作 array_cat(ARRAY[['digoal','zhou'],['francs','tan']], ARRAY['david','guo']) subscript 元素脚本默认从1开始, 也可以指定. array_lower(ARRAY[[1,2,3,4,5],[6,7,8,9,10]], 2) array_lower('[-3:-2]={1,2}'::int[], 1) select array_upper('[-3:-2]={1,2}'::int[], 1)

Data Type Array slice array_dims(ARRAY[[1,2,3,4,5],[6,7,8,9,10]])
第一个[]中的1表示低位subscript, 2表示高位subscript值. 第二个[]中左边的1表示低位subscript, 右边的1表示高位subscript值. a[2:3][1:2] = {{3,4},{5,6}} 分片的另一种写法, 只要其中的一个维度用了分片写法, 其他的维度如果没有使用分片写法, 默认视为高位如a[2:3][2] 等同于 a[2:3][1:2] PostgreSQL ARRAY datatype introduce

Data Type Array function 与操作符

Data Type Composite Type 自定义
create type test as (info text,id int,crt_time timestamp(0)); 创建表时默认创建一个同名composite type, 因此表名和自定义类名不能重复 create table test (id int primary key,info text); ERROR: relation "test" already exists 举例 CREATE TYPE inventory_item AS ( name text, supplier_id integer, price numeric ); CREATE TABLE on_hand ( item inventory_item, count integer

SET和INTO子句复合类型不能加括号引用,其他子句中的复合类型可以加括号引用
Data Type Composite Type INSERT INTO on_hand VALUES (ROW('fuzzy dice', 42, 1.99), 1000); SELECT (on_hand.item).name FROM on_hand WHERE (on_hand.item).price <10; name fuzzy dice UPDATE on_hand SET item = ROW('fuzzy dice',10,100) WHERE count=1000; UPDATE on_hand SET item.price = (on_hand.item).price WHERE (on_hand.item).name='fuzzy dice'; INSERT INTO on_hand (item.name, item.supplier_id) VALUES('test', 2.2); postgres=# select * from on_hand; item | count ("fuzzy dice",10,200) | 1000 (test,2,) | SET和INTO子句复合类型不能加括号引用,其他子句中的复合类型可以加括号引用

Data Type oid (object identifier) 4 bytes
xid (transaction identifier) 4 bytes xmin,xmax cid (command identifier) 4 bytes cmin,cmax tid (tuple identifier) 6 bytes ctid 以下为各系统表对应的oid列的alias, 类型都是oid 可使用namespace, 或者默认的search_path先后顺序检索

Data Type test=# create sequence seq_test start with 1;
test=# select 'seq_test'::regclass; regclass seq_test test=# select 'seq_test'::regclass::oid; oid 49247 test=# select 'sum(int4)'::regprocedure; regprocedure sum(integer) test=# select 'sum(int4)'::regprocedure::oid; 2108

Data Type Pseudo-Types 伪类型

Functions and Operators
摘录部分详见

逻辑序列比较条件表达式算数数组字符集合 bytea 窗口 bit 子查询表达式规则表达式行与数组比较格式化输出返回集合的函数时间触发器函数枚举几何网络地址全文检索 XML

逻辑操作符 AND OR NOT

比较 a BETWEEN x AND y test=# select 1 where null is distinct from null; a >= x AND a <= y a NOT BETWEEN x AND y test=# select 1 where null is not distinct from null; a < x OR a > y 1 IS [NOT] NULL expression IS TRUE expression IS NOT TRUE test=# select 1 where null = null; expression IS FALSE (0 rows) expression IS NOT FALSE test=# select 1 where null <> null; expression IS UNKNOWN expression IS NOT UNKNOWN

数学函数、操作符略数学函数三角函数

字符函数、操作符

bytea函数、操作符

bit函数、操作符

样式匹配、规则表达式 LIKE SIMILAR TO 规则表达式 POSIX 规则表达式

格式化输出函数略日期、时间样式数字样式

日期、时间函数或操作符操作符函数 extract , date_part date_trunc pg_sleep Make sure that your session does not hold more locks than necessary when calling pg_sleep. Otherwise other sessions might have to wait for your sleeping process, slowing down the entire system.

日期、时间函数或操作符

枚举函数或操作符

集合函数或操作符

集合函数或操作符类型转换函数

全文检索函数或操作符

序列函数或操作符

条件函数或操作符 CASE COALESCE The COALESCE function returns the first of its arguments that is not null. Null is returned only if all arguments are null. It is often used to substitute a default value for null values when data is retrieved for display, for example: NULLIF The NULLIF function returns a null value if value1 equals value2; otherwise it returns value1. GREATEST and LEAST

数组函数或操作符操作符

数组函数或操作符函数

集合函数多值输入单值输出

窗口函数前面的章节有例子

子查询表达式 row_constructor operator (subquery) EXISTS EXISTS (subquery) [NOT] IN expression [NOT] IN (subquery) row_constructor [NOT] IN (subquery) ANY / SOME expression operator ANY | SOME (subquery) row_constructor operator ANY | SOME (subquery) IN is equivalent to = ANY ALL expression operator ALL (subquery) row_constructor operator ALL (subquery) NOT IN is equivalent to <> ALL

ARRAY与表达式比较 expression operator ANY | SOME (array expression) expression operator ALL (array expression)

返回多行的函数

系统信息函数会话信息

系统信息函数 - 访问权限函数

系统信息函数 - SCHEMA可见性函数

系统信息函数 - System Catalog Information 函数

系统信息函数 - System Catalog Information 函数注释信息函数

系统信息函数 - 事务ID与Snapshot 函数 txid_snapshot结构

系统管理函数 - 配置设置函数信号函数

系统管理函数 - 备份控制函数恢复信息函数

系统管理函数 - 恢复控制函数对象大小查询函数

系统管理函数 - 对象物理位置查询函数文件访问函数

系统管理函数 - advisory锁函数

系统管理函数 - advisory锁函数数据库锁(对应用来说不可控,隐锁) 长事务不适合,降低了被锁记录的相关并发导致DEAD TUPLE无法回收. advisory锁(应用控制,显锁) 特殊场景, 如需要长时间持锁. 但是又不能影响并发. 应用例子:

触发器函数 test=# \sf suppress_redundant_updates_trigger CREATE OR REPLACE FUNCTION pg_catalog.suppress_redundant_updates_trigger() RETURNS trigger LANGUAGE internal STRICT AS $function$suppress_redundant_updates_trigger$function$ 这个函数有什么用呢? 没有数据更新的更新操作不会产生新版本. test=# create table test (id int); test=# insert into test values (1),(2),(3); test=# select ctid,* from test where id=1; ctid | id (0,1) | 1

触发器函数 test=# update test set id=1 where id=1; test=# select ctid,* from test where id=1; ctid | id (0,4) | 1 CREATE TRIGGER z_min_update BEFORE UPDATE ON test FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger(); UPDATE 0

Type Conversion The PostgreSQL scanner/parser divides lexical elements into five fundamental categories: integers, non-integer numbers, strings, identifiers, and key words. 强类型指定可以提高性能. 以下4种SQL中的操作需要接受指定类型的数据 Function PostgreSQL中不是根据函数名来区分函数, 而是函数名和函数的参数类型. 同一个SCHEMA中允许重名但参数个数不同或类型不完全相同的多个函数同时存在. Operator 操作符在schema中也允许重名, 只要操作符的操作数不同 Value Storage INSERT, UPDATE UNION, CASE, ARRAY... 合并(UNION)或选择性(CASE)多值输出时必须确保每列的输出类型一致, ARRAY也必须保证元素类型一致. 类型转换 CAST(value AS target_type) value::target_type There should be no extra overhead in the parser or executor if a query does not need implicit type conversion. That is, if a query is well-formed and the types already match, then the query should execute without spending extra time in the parser and without introducing unnecessary implicit conversion calls in the query.

Index Planner Methods #enable_bitmapscan = on #enable_hashagg = on
#enable_hashjoin = on #enable_indexscan = on #enable_material = on #enable_mergejoin = on #enable_nestloop = on #enable_seqscan = on #enable_sort = on #enable_tidscan = on 索引主要能干什么? 加速TUPLE定位主键, 唯一约束排序

Index Index Type OR Access Method b-tree hash gist gin
spgist (PostgreSQL 9.2) B-tree Operators >, <, >=, <=, =, BETWEEN, IN, IS NULL, IS NOT NULL LIKE(开头匹配), ILIKE (大小写一致的字符开头匹配) Hash Operator = 不记录WAL, 数据库crash后如果hash索引对应的表在恢复期间做过改动, hash需要重建. hash索引不支持基于wal或流复制的standby

Index GiST <<, &<, &>, >>, <<|, &<|, |&>, ~=, && nearest-neighbor search GIN @>, =, && 建立索引的语法注意, method, collation, opclass 建立operator class的语法, 通过系统表查询当前支持哪些operator class已经operator class中定义的operator和function, 是用于检索还是排序. pg_am, pg_amop, pg_amproc, pg_operator, pg_opclass, pg_opfamily

Index 用索引和操作符<->快速检索与某point邻近的点举例如果在location上建了一个GiST索引
SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10;

Index 是否使用索引和什么有关? 遵循完CBO的选择, 还需要符合当前配置的Planner 配置.
首先是前面提到的Access Method, 然后是使用的operator class, 以及opc中定义的operator或function. #enable_bitmapscan = on #enable_hashagg = on 这些都满足后, 还要遵循CBO的选择. #enable_hashjoin = on #seq_page_cost = 1.0 #enable_indexscan = on #random_page_cost = 4.0 #enable_material = on #cpu_tuple_cost = 0.01 #enable_mergejoin = on #cpu_index_tuple_cost = 0.005 #enable_nestloop = on #cpu_operator_cost = #enable_seqscan = on #effective_cache_size = 128MB #enable_sort = on #enable_tidscan = on

Index Multicolumn Index
only the B-tree, GiST and GIN index types support multicolumn indexes. Up to 32 columns can be specified. (This limit can be altered when building PostgreSQL; see the file pg_config_manual.h.) B-tree Query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. GiST Query conditions that involve any subset of the index's columns. Conditions on additional columns restrict the entries returned by the index, but the condition on the first column is the most important one for determining how much of the index needs to be scanned. A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns. GIN A multicolumn GIN index can be used with query conditions that involve any subset of the index's columns. Unlike B-tree or GiST, index search effectiveness is the same regardless of which index column(s) the query conditions use.

Index Multicolumn Index
多列索引, 使用任何列作为条件, 只要条件中的操作符或函数能满足opclass的匹配, 都可以使用索引, 索引被扫描的部分还是全部基本取决于条件中是否有索引的第一列作为条件之一. 例子 postgres=# create table test (c1 int,c2 int); postgres=# insert into test select 1,generate_series(1,100000); postgres=# create index idx_test_1 on test(c1,c2); postgres=# analyze test; postgres=# explain select * from test where c2=100; Seq Scan on test (cost= rows=1 width=8) Filter: (c2 = 100) postgres=# set enable_seqscan=off; postgres=# explain analyze select * from test where c2=100; Index Scan using idx_test_1 on test (cost= rows=1 width=8) (actual time= rows=1 loops=1) Index Cond: (c2 = 100) 注意过滤条件不是驱动列. 看似不能走索引

Index 使用索引来排序可以减少CPU对排序的开销, 特别是仅需返回少量行时. 使用索引效率会大大提高. 例子
postgres=# create table test (id int,info text); postgres=# insert into test select generate_series(1,100000),'digoal'||generate_series(1,100000); postgres=# explain analyze select * from test order by id limit 10; Limit (cost= rows=10 width=36) (actual time= rows=10 loops=1) -> Sort (cost= rows= width=36) (actual time= rows=10 loops=1) Sort Key: id Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on test (cost= rows= width=36) (actual time= rows= loops=1) Total runtime: ms postgres=# create index idx_test_id on test(id); Limit (cost= rows=10 width=36) (actual time= rows=10 loops=1) -> Index Scan using idx_test_id on test (cost= rows= width=36) (actual time= rows=10 loops=1) Total runtime: ms

Index 加速reference 表的reference colum的更新和删除操作
postgres=# create table t1 (id int primary key,info text); postgres=# insert into t1 select generate_series(1,100000),'digoal'||generate_series(1,100000); postgres=# create table t2 (id int references t1(id) on update cascade,info text); postgres=# insert into t2 select generate_series(1,100000),'digoal'||generate_series(1,100000); postgres=# explain analyze update t1 set id= where id=100000; Update on t1 (cost= rows=1 width=17) (actual time= rows=0 loops=1) -> Bitmap Heap Scan on t1 (cost= rows=1 width=17) (actual time= rows=1 loops=1) Recheck Cond: (id = ) -> Bitmap Index Scan on t1_pkey (cost= rows=1 width=0) (actual time= rows=1 loops=1) Index Cond: (id = ) Trigger for constraint t2_id_fkey on t1: time= calls=1 Trigger for constraint t2_id_fkey on t2: time=0.094 calls=1 Total runtime: ms

Index postgres=# create index idx_t2_id on t2(Id);
postgres=# explain analyze update t1 set id= where id=100001; Update on t1 (cost= rows=1 width=17) (actual time= rows=0 loops=1) -> Bitmap Heap Scan on t1 (cost= rows=1 width=17) (actual time= rows=1 loops=1) Recheck Cond: (id = ) -> Bitmap Index Scan on t1_pkey (cost= rows=1 width=0) (actual time= rows=1 loops=1) Index Cond: (id = ) Trigger for constraint t2_id_fkey on t1: time=0.516 calls=1 Trigger for constraint t2_id_fkey on t2: time=0.058 calls=1 Total runtime: ms

Index 唯一约束和主键字段创建唯一索引
postgres=# create table test (id int primary key,info text unique); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey" for table "test" NOTICE: CREATE TABLE / UNIQUE will create implicit index "test_info_key" for table "test" postgres=# \d test Table "public.test" Column | Type | Modifiers id | integer | not null info | text | Indexes: "test_pkey" PRIMARY KEY, btree (id) "test_info_key" UNIQUE CONSTRAINT, btree (info)

Index Bitmap Index Scan filter Bitmap, TupleAddr(s) BitmapAnd | Or
Combining Multiple Indexes src/backend/executor 例如 Bitmap Index Scan filter Bitmap, TupleAddr(s) BitmapAnd | Or Bitmap Heap Scan Bitmap Index Scan filter Bitmap, TupleAddr(s) 。。。。。。。。。。。。。。。。。。

Index Combining Multiple Indexes one index combining
postgres=# create table test (id int primary key,info text unique); postgres=# insert into test select generate_series(1,100000),'digoal'||generate_series(1,100000); postgres=# explain analyze select * from test where id=1 or id=1000; Bitmap Heap Scan on test (cost= rows=2 width=36) (actual time= rows=2 loops=1) Recheck Cond: ((id = 1) OR (id = 1000)) -> BitmapOr (cost= rows=2 width=0) (actual time= rows=0 loops=1) -> Bitmap Index Scan on test_pkey (cost= rows=1 width=0) (actual time= rows=1 loops=1) Index Cond: (id = 1) -> Bitmap Index Scan on test_pkey (cost= rows=1 width=0) (actual time= rows=1 loops=1) Index Cond: (id = 1000)

Index multiple index combining
postgres=# explain analyze select * from test where id=1 or info='digoal1000'; Bitmap Heap Scan on test (cost= rows=2 width=15) (actual time= rows=2 loops=1) Recheck Cond: ((id = 1) OR (info = 'digoal1000'::text)) -> BitmapOr (cost= rows=2 width=0) (actual time= rows=0 loops=1) -> Bitmap Index Scan on test_pkey (cost= rows=1 width=0) (actual time= rows=1 loops=1) Index Cond: (id = 1) -> Bitmap Index Scan on test_info_key (cost= rows=1 width=0) (actual time= rows=1 loops=1) Index Cond: (info = 'digoal1000'::text) Total runtime: ms

Index collection 例子 CREATE TABLE test1c ( id integer,
content varchar COLLATE "x" ); CREATE INDEX test1c_content_index ON test1c (content); SELECT * FROM test1c WHERE content > constant; -- 以下SQL不能使用索引test1c_content_index SELECT * FROM test1c WHERE content > constant COLLATE "y"; -- 需建立与y COLLATE对应的索引, 以上这条SQL才会走索引. CREATE INDEX test1c_content_y_index ON test1c (content COLLATE "y");

Index partial index 例子 -- 部分约束
--去除common值 id=1, 这个值有10W条, 走索引根本不合适. partial 索引很好的避免了此类情况. postgres=# create table test(id int,info text); postgres=# insert into test select 1,'digoal'||generate_series(1,100000); postgres=# insert into test select generate_series(1,1000),'digoal'||generate_series(1,1000); postgres=# create index idx_test_1 on test(id) where id<>1; postgres=# explain select * from test where id=1; Seq Scan on test (cost= rows= width=15) Filter: (id = 1) postgres=# explain select * from test where id=100; Index Scan using idx_test_1 on test (cost= rows=1 width=15) Index Cond: (id = 100)

Index -- 非索引列的使用 postgres=# explain select * from test where info='digoal' and id=1; QUERY PLAN Seq Scan on test (cost= rows=1 width=15) Filter: ((info = 'digoal'::text) AND (id = 1)) postgres=# create index idx_test_2 on test(id) where info='digoal100'; postgres=# explain select * from test where info='digoal100'; Index Scan using idx_test_2 on test (cost= rows=1 width=15) (1 row)

Index -- 为什么要去除common 值 postgres=# drop index idx_test_1;
postgres=# explain select * from test where id=1; QUERY PLAN Seq Scan on test (cost= rows= width=15) Filter: (id = 1) -- 为什么会走全表扫描 postgres=# select id,count(*) from test group by id order by count(*) desc limit 10; id | count 1 | 120 | 887 | 681 |

Index 函数索引和表达式索引表达式索引
postgres=# explain select * from test where id+1=100; QUERY PLAN Seq Scan on test (cost= rows=505 width=15) Filter: ((id + 1) = 100) postgres=# create index idx_test_1 on test((id+1)); CREATE INDEX Bitmap Heap Scan on test (cost= rows=505 width=15) Recheck Cond: ((id + 1) = 100) -> Bitmap Index Scan on idx_test_1 (cost= rows=505 width=0) Index Cond: ((id + 1) = 100)

Index 函数索引 -- 以下区分大小写的场景无法使查询走普通的索引.
postgres=# create table test (id int,info text,crt_time timestamp(0)); postgres=# insert into test select generate_series(1,100000),'digoal'||generate_series(1,100000),clock_timestamp(); postgres=# create index idx_test_info on test(info); postgres=# explain select * from test where info ~* '^a'; Seq Scan on test (cost= rows=10 width=23) Filter: (info ~* '^a'::text) -- 忽略大小写的ilike和~* 要走索引的话, 开头的字符只能是大小写一致的, 字母不行.数字可以. 例如字母a区分大小写, 数组0不区分大小写.索引中的条目也就有差别. postgres=# explain select * from test where info ~* '^0'; Index Scan using idx_test_info on test (cost= rows=10 width=23) Index Cond: ((info >= '0'::text) AND (info < '1'::text)) Filter: (info ~* '^0'::text)

Index 函数索引 -- 要让字母也可以走忽略大小写的索引如何做呢? -- 函数索引, 但是函数必须是immutable状态的
过滤条件中也必须使用和创建的索引相同声明 postgres=# select proname,provolatile from pg_proc where proname='lower'; proname | provolatile lower | i postgres=# create index idx_test_info_1 on test(lower(info)); CREATE INDEX postgres=# explain select * from test where lower(info) ~ '^a'; Bitmap Heap Scan on test (cost= rows=500 width=23) Filter: (lower(info) ~ '^a'::text) -> Bitmap Index Scan on idx_test_info_1 (cost= rows=500 width=0) Index Cond: ((lower(info) >= 'a'::text) AND (lower(info) < 'b'::text)) (4 rows)

Index 作为查询条件的函数或常量或变量或子查询
优化器需要知道给operator的参数值才能通过pg_statistic中统计到的表柱状图来计算走索引还是走全表扫描或者其他planner的开销最小, 如果传入的是个变量则通常走默认的优先执行计划. postgres=# create index idx_test_1 on test (crt_time); postgres=# select proname,proargtypes,provolatile from pg_proc where prorettype in (1114,1184) order by proargtypes; proname | proargtypes | provolatile transaction_timestamp | | s statement_timestamp | | s pg_stat_get_bgwriter_stat_reset_time | | s pg_conf_load_time | | s pg_postmaster_start_time | | s pg_last_xact_replay_timestamp | | v

Index 作为查询条件的函数或常量或变量或子查询 clock_timestamp | | v now | | s
postgres=# explain select * from test where crt_time = clock_timestamp(); Seq Scan on test (cost= rows= width=23) Filter: (crt_time = clock_timestamp()) postgres=# explain select * from test where crt_time = now(); Index Scan using idx_test_1 on test (cost= rows=1 width=23) Index Cond: (crt_time = now()) postgres=# alter function now() strict volatile; Filter: (crt_time = now())

Index 作为查询条件的函数或常量或变量或子查询
postgres=# alter function clock_timestamp() strict immutable; ALTER FUNCTION postgres=# explain select * from test where crt_time = clock_timestamp(); QUERY PLAN Index Scan using idx_test_1 on test (cost= rows=1 width=23) Index Cond: (crt_time = ' :32: '::timestamp with time zone) 作为过滤条件的函数, immutable 和 stable 的函数在优化器开始计算COST前会把函数值算出来. 而volatile的函数, 是在执行SQL的时候运行的, 所以无法在优化器计算执行计划的阶段得到函数值, 也就无法和pg_statistic中的信息比对到底是走索引呢还是全表扫描或其他执行计划.

Index 表达式作为过滤条件时, 同样的道理, 表达式不会在优化器计算执行计划的过程中运算, 所以也不能走最优的执行计划.
postgres=# explain select * from test where crt_time = (select now()); QUERY PLAN Seq Scan on test (cost= rows= width=23) Filter: (crt_time = $0) InitPlan 1 (returns $0) -> Result (cost= rows=1 width=0) (4 rows) 绑定变量是否走索引取决于驱动, 机制和以上类似.(通常SESSION的第一个解析为硬解析,后面都是软解析.硬解析的时候的执行计划决定了后面的所有执行计划)

Index 索引带来的Modify Overhead 降低Overhead的手段之一 : HOT
创建表时设置fillfactor < 100, 预留空间给HOT. 在了解HOT前先了解一下数据库的存储以及它们的数据结构.

DataFile Storage Layout
One DataFile(s) Per Table or Index . BlockID : sequentially, 0 to 0xFFFFFFFE Initilized Block 0x Initilized Block 0x Initilized Block 0x Initilized Block 0x Initilized Block 0x Initilized Block 0xFFFFFFFE

ONE PAGE Page Layout PageHeaderData(24 Bytes)
ItemIdData(Array of (offset,flag,length) pairs pointing to the actual items. 4 bytes per item) Free space(The unallocated space. New item pointers are allocated from the start of this area, new items from the end.) Items (The actual items themselves.) Special space (Index access method specific data. Different methods store different data. Empty in ordinary tables.)(an access method should always initialize its pages with PageInit and then set its own opaque fields.)

PageHeader Layout Field Type Length Description pd_lsn XLogRecPtr
8 bytes LSN: next byte after last byte of xlog record for last change to this page pd_tli uint16 2 bytes TimeLineID of last change (only its lowest 16 bits) pd_flags Flag bits pd_lower LocationIndex Offset to start of free space pd_upper Offset to end of free space pd_special Offset to start of special space pd_pagesize_version Page size and layout version number information pd_prune_xid TransactionId 4 bytes Oldest unpruned XMAX on page, or zero if none

pd_flags define /* * pd_flags contains the following flag bits. Undefined bits are initialized * to zero and may be used in the future. * * PD_HAS_FREE_LINES is set if there are any LP_UNUSED line pointers before * pd_lower. This should be considered a hint rather than the truth, since * changes to it are not WAL-logged. * PD_PAGE_FULL is set if an UPDATE doesn't find enough free space in the * page for its new tuple version; this suggests that a prune is needed. * Again, this is just a hint. */ #define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */ #define PD_PAGE_FULL 0x0002 /* not enough free space for new tuple? */ #define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to everyone */ #define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */

ItemIdData Layout /* An item pointer (also called line pointer) on a buffer page */ /* In some cases an item pointer is "in use" but does not have any associated */ /* storage on the page. By convention, lp_len == 0 in every item pointer */ /* that does not have storage, independently of its lp_flags state. */ typedef struct ItemIdData { unsigned lp_off:15, /* offset to tuple (from start of page) */ lp_flags:2, /* state of item pointer, see below */ lp_len:15; /* byte length of tuple */ } ItemIdData; /* lp_flags has these possible states. An UNUSED line pointer is available */ /* for immediate re-use, the other states are not. */ #define LP_UNUSED 0 /* unused (should always have lp_len=0) */ #define LP_NORMAL 1 /* used (should always have lp_len>0) */ #define LP_REDIRECT 2 /* HOT redirect (should have lp_len=0) */ #define LP_DEAD 3 /* dead, may or may not have storage */

Tuple Layout HeapTupleHeaderData(23 Bytes on most machine)
One Tuple HeapTupleHeaderData(23 Bytes on most machine) null bitmap(Optional,The null bitmap is only present if the HEAP_HASNULL bit is set in t_infomask. occupies enough bytes to have one bit per data column. a 1 bit indicates not-null, a 0 bit is a null) Padding(Optional, Any padding needed to make t_hoff a MAXALIGN multiple will appear between the null bitmap and the object ID) object ID(Optional, only present if the HEAP_HASOID bit is set in t_infomask) ColumnData

HeapTupleHeader Layout
Field Type Length Description t_xmin TransactionId 4 bytes insert XID stamp t_xmax delete XID stamp t_cid CommandId insert and/or delete CID stamp (overlays with t_xvac) t_xvac XID for VACUUM operation moving a row version t_ctid ItemPointerData 6 bytes current TID of this or newer row version t_infomask2 int16 2 bytes number of attributes, plus various flag bits t_infomask uint16 various flag bits t_hoff uint8 1 byte offset to user data

INDEX Pointer Introduction
ItemPointers (index) -> ItemId数据结构 (Array of (lp_off:15bit, lp_flags:2bit,lp_len:15bit) pairs pointing to the actual items. 4 bytes per ItemId.) -> Item (tuple)

HOT Introduction Heap-Only Tuple Benefit :
eliminates redundant index entries allows the re-use of space taken by DELETEd or obsoleted UPDATEd tuples without performing a table-wide vacuum. Example Update 1: Index points to 1 line points [1] [2] Items [ ]->[ ] Update 2: Index points to 1 line point [1]->[2] Items [ ] Update 3: Index points to 1 line points [1]->[2] [3] Items [ ]->[ ] Update 4: Index points to 1 line points [1]------>[3] Items [ ]

HOT Update Heap Page Index Page PageHead PageHead Free Space Tuple2
LinePoint1 LinePoint2 ItemPoint Tuple2 HEAP_HOT_TUPLE Tuple1 t_ctid, HEAP_HOT_UPDATED

HOT Update Heap Page Index Page PageHead PageHead Free Space Tuple2
LinePoint1 LinePoint2 ItemPoint LP_REDIRECT Tuple2 HEAP_HOT_TUPLE Tuple1 Dead,vacuumed

注意redirect后,lp_off的值表示第几条itemid, 而不是offset_bytes.
HOT Update postgres=# select * from heap_page_items(get_raw_page('hot_test',0)); lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid 1 | | | | | | | | | | | | 2 | | | | | | | (0,3) | | | | | 3 | | | | | | | (0,4) | | | | | 4 | | | | | | | (0,5) | | | | | 5 | | | | | | | (0,6) | | | | | 6 | | | | | | | (0,7) | | | | | 7 | | | | | | | (0,8) | | | | | 8 | | | | | | | (0,8) | | | | | 注意redirect后,lp_off的值表示第几条itemid, 而不是offset_bytes.

Index 为什么要使用concurrently创建索引?
索引快还是全表扫描快? 取决于缓存的大小,存储的IOPS能力, 是否使用索引排序以及SQL需要发起的IO次数(索引, 离散IO. 全表扫描, 顺序IO)等. 我建的索引到底有没有被系统用到, 还是说它就是个费索引 explain pg_statio_all_indexes For each index in the current database, the table and index OID, schema, table and index name, numbers of disk blocks read and buffer hits in that index. pg_stat_all_indexes For each index in the current database, the table and index OID, schema, table and index name, number of index scans initiated on that index, number of index entries returned by index scans, and number of live table rows fetched by simple index scans using that index.

Full Text Search 本期培训略支持PostgreSQL的全文检索软件如sphinx.

Concurrency Control EveryOne Must Know
Bussiness Rules & PostgreSQL ISOLATION LEVEL

Concurrency Control SQL标准定义的隔离级别和读保护对应关系
注意SQL标准: minimum protections each isolation level must provide PostgreSQL实现的隔离级别和读保护对应关系 Real Serializable 除此以外还实现了并行事务串行化的组合检测. Isolation Level Dirty Read NonRepeatable Read Phantom Read Read uncommitted Not possible Possible Read committed Repeatable read Serializable

Concurrency Control dirty read
A transaction reads data written by a concurrent uncommitted transaction. nonrepeatable read A transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read). 即一个事务中分两次执行同样的SQL, 查询某数据时, 前后得到的结果不一致. 也就是说在这两个SQL之间有另一个事务把该数据修改并提交了. phantom read A transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction. 即一个事务中分两次执行了同样的SQL, 查询一个区间的数据, 前后得到的结果不一致. 后面的SQL可能得到更多的数据, 例如在这两个SQL之间另一个事务插入了一些在这个查询的区间范围内的数据并提交了.

Concurrency Control 为什么PostgreSQL实现的隔离级别对读保护超出了SQL标准定义的最小保护
PostgreSQL并发控制的手段, MVCC xmin, xmax, xid INSERT, DELETE, UPDATE INSERT, xmin = current xid DELETE, xmax = current xid UPDATE, old tuple xmax = current xid, new tuple xmin = current xid 因此PostgreSQL很容易通过MVCC来实现不同的隔离级别. 可以理解为如下, 当然内部实现比这复杂得多 read committed 拿系统已分配出去的最后一个事务ID作为比较, 去除未提交的那些事务ID, 能见到所有小于等于这个事务ID的所有行. repeatable read 记录下事务开始时有哪些未提交的事务，事务中执行的SQL拿事务开始时的事务ID作为比较, 去除事务开始时未提交的事务, 能见到的记录范围是小于等于这个事务ID的所有行.

Concurrency Control txid_current_snapshot()函数 Name Description xmin
Earliest transaction ID (txid) that is still active. All earlier transactions will either be committed and visible, or rolled back and dead. xmax First as-yet-unassigned txid. All txids greater than or equal to this are not yet started as of the time of the snapshot, and thus invisible. xip_list Active txids at the time of the snapshot. The list includes only those active txids between xmin and xmax; there might be active txids higher than xmax. A txid that is xmin <= txid < xmax and not in this list was already completed at the time of the snapshot, and thus either visible or dead according to its commit status. The list does not include txids of subtransactions.

Concurrency Control Session A : Session B : digoal=> begin;
digoal=> update tbl_user set id=2 where id=1; digoal=> select ctid,xmin,xmax,cmin,cmax,* from tbl_user; ctid | xmin | xmax | cmin | cmax | id | firstname | lastname | corp | age (0,2) | 3910 | 0 | 0 | 0 | 2 | zhou | digoal | sky-mobi | 27 digoal=> select txid_current_snapshot(); 3910:3914: Session B : select ctid,xmin,xmax,cmin,cmax,* from tbl_user; (0,1) | 3909 | 3910 | 0 | 0 | 1 | zhou | digoal | sky-mobi | 27 3910:3914:3910

Concurrency Control read committed 隔离级别用例 BEGIN;
UPDATE accounts SET balance = balance WHERE acctnum = 12345; UPDATE accounts SET balance = balance WHERE acctnum = 7534; COMMIT; read committed隔离级别不适合以下场景 -- assume website is a two-row table with website.hits equaling 9 and 10 UPDATE website SET hits = hits + 1; -- 9改为10, 10改为11, 同时这两行被加锁 -- run from another session: DELETE FROM website WHERE hits = 10; -- 另一个session它不能看到未提交的记录, 它等待的锁是老记录的行锁, hits实际上已经被修改为11, 也就是当前有两条记录 -- xmin 有值,xmax 有值, xmin 有值,xmax=0, 另一个session在等待的是xmin,xmax都有值的那条老的记录的锁释放. -- 当事务1提交后, 另一个事务同时获得了这个锁, 但是它所看到的这条记录的hits目前是11. 索引会导致delete0条记录的情况.

Concurrency Control Repeatable Read Isolation Level
事务开始后同样的SQL不管执行多少次都返回同样的结果. 但是repeatable read事务不能修改在repeatable read事务执行过程中被其他事务修改并提交了的记录. 否则会抛出异常. ERROR: could not serialize access due to concurrent update 例如 : postgres=# insert into test values (1,'digoal1'),(100,'digoal100'); commit; SESSION A: postgres=# begin transaction isolation level repeatable read; postgres=# select * from test where id=1; id | info 1 | digoal1

Concurrency Control SESSION B: postgres=# begin;
postgres=# update test set info='new_digoal' where id=1; postgres=# commit; SESSION A: postgres=# select * from test where id=1; id | info 1 | digoal1 postgres=# select count(*) from test; count 2

Concurrency Control SESSION B: postgres=# begin;
postgres=# insert into test select generate_series(1000,1100),'digoal'; INSERT 0 101 postgres=# end; -- 这个在其他数据库(如oracle)中需要serializable read隔离级别才能实现. SESSION A: postgres=# select count(*) from test; count 2 postgres=# update test set info='session a' where id=1; ERROR: could not serialize access due to concurrent update ROLLBACK

Concurrency Control SERIALIZABLE READ
In fact, this isolation level works exactly the same as Repeatable Read except that it monitors for conditions which could make execution of a concurrent set of serializable transactions behave in a manner inconsistent with all possible serial (one at a time) executions of those transactions. This monitoring does not introduce any blocking beyond that present in repeatable read, but there is some overhead to the monitoring, and detection of the conditions which could cause a serialization anomaly will trigger a serialization failure. SIReadLock The particular locks acquired during execution of a query will depend on the plan used by the query, and multiple finer-grained locks (e.g., tuple locks) may be combined into fewer coarser-grained locks (e.g., page locks) during the course of the transaction to prevent exhaustion of the memory used to track the locks. A READ ONLYtransaction may be able to release its SIRead locks before completion, if it detects that no conflicts can still occur which could lead to a serialization anomaly. In fact, READ ONLYtransactions will often be able to establish that fact at startup and avoid taking any predicate locks. If you explicitly request a SERIALIZABLE READ ONLY DEFERRABLE transaction, it will block until it can establish this fact. (This is the only case where Serializable transactions block but Repeatable Read transactions don't.) On the other hand, SIRead locks often need to be kept past transaction commit, until overlapping read write transactions complete. 成功的提交并行的serializable read事务表示这些事务不管以什么顺序执行得到的结果都是一样的. 否则必将有事务会失败. 而应用必须要能够应对这种失败, 例如重新执行一遍失败的事务并再提交.

Concurrency Control SERIALIZABLE READ 用例 class | value
1 | 10 1 | 20 2 | 100 2 | 200 SERIALIZABLE SESSION A: SELECT SUM(value) FROM mytab WHERE class = 1; inserts the result (30) as the value in a new row with class = 2; SERIALIZABLE SESSION B: SELECT SUM(value) FROM mytab WHERE class = 2; obtains the result 300, which it inserts in a new row with class = 1; COMMIT; COMMIT;失败, 反之亦然要使用repeatable read隔离级别实现与此同样的场景,需要用到select for update或for share, 限制了并发并且带来了PAGE扫描.因此serializable隔离级别更加适合.

Concurrency Control SERIALIZABLE READ 事务冲突检测条件 :
换句话说冲突只会在serializable 和serializable的事务之间发生，不会发生在serializable和read committed的事务之间等等。后面会有例子。 2. 发生冲突的事务必须存在至少同一个表的操作交集。（一个事务查询了某些记录，另一个事务更新或删除了这些记录中的部分或全部。或者两个事务同时查询了相同的记录。） 3. 发生冲突的事务必须对产生交集的表都有写的操作(insert,delete,update之一) ，并且每个事务的写操作必须至少影响1行记录及以上。 4. 发生冲突的事务都必须有2条或以上SQL(当DML和DSL没有已经存在的行交集时)，或者其中一个SESSION的DML（update,delete）与另一个SESSION有行交集。当冲突发生时，第一时间提交的事务可以成功返回，在冲突域里面，后提交的所有事务都被自动ROLLBACK。并且报错： ERROR: could not serialize access due to read/write dependencies among transactions DETAIL: Reason code: Canceled on identification as a pivot, during commit attempt . HINT: The transaction might succeed if retried .

Concurrency Control SERIALIZABLE READ
以下情况不会触发could not serialize access due to read/write dependencies among transactions错误所以SERIALIZABLE不能帮你实现这种BUSSINESS RULE postgres=# create table t1 (class int, value int); postgres=# create table t2 (class int, value int); postgres=# insert into t1 values (1,10),(1,20),(2,100),(2,200); SESSION A: postgres=# begin transaction isolation level serializable; postgres=# select sum(value) from t1 where class=1; sum ----- 30 (1 row) SESSION B:

Concurrency Control postgres=# select sum(value) from t1 where class=2; sum ----- 300 (1 row) SESSION A: postgres=# insert into t1 values(2,30); -- 注意下面要插入的是另一张表. SESSION B: postgres=# insert into t2 values (1,300); postgres=# end; 成功成功, -- 多希望他失败啊. 因为它也用到了B所改变的记录啊.

Concurrency Control serializable使用注意实现以及优化建议
Declare transactions as READ ONLY when possible. Control the number of active connections, using a connection pool if needed. This is always an important performance consideration, but it can be particularly important in a busy system using Serializable transactions. Don't put more into a single transaction than needed for integrity purposes. Don't leave connections dangling "idle in transaction" longer than necessary. Eliminate explicit locks, SELECT FOR UPDATE, and SELECT FOR SHARE where no longer needed due to the protections automatically provided by Serializable transactions. When the system is forced to combine multiple page-level predicate locks into a single relation-level predicate lock because the predicate lock table is short of memory, an increase in the rate of serialization failures may occur. You can avoid this by increasing max_pred_locks_per_transaction. A sequential scan will always necessitate a relation-level predicate lock. This can result in an increased rate of serialization failures. It may be helpful to encourage the use of index scans by reducing random_page_cost and/or increasing cpu_tuple_cost. Be sure to weigh any decrease in transaction rollbacks and restarts against any overall change in query execution time. hot standby最高只支持到repeatable read隔离级别. PostgreSQL 9.1 serializable isolation conflict occur condition and compared with Oracle

Concurrency Control 行锁 select for update 相互冲突 select for share
相互不冲突, 但是不允许修改或删除被锁的行. $PGDATA/pg_multixact中存储了此类信息行锁 modifies selected rows to mark them locked, and so will result in disk writes. 页锁 In addition to table and row locks, page-level share/exclusive locks are used to control read/write access to table pages in the shared buffer pool. These locks are released immediately after a row is fetched or updated. 死锁事务直接形成了相互等待的局面, 可以发生在两个或以上事务中, 是应用设计的时候需要避免的

Concurrency Control Lock and Index
Though PostgreSQL provides nonblocking read/write access to table data, nonblocking read/write access is not currently offered for every index access method implemented inPostgreSQL. The various index types are handled as follows: B-tree and GiST indexesShort-term share/exclusive page-level locks are used for read/write access. Locks are released immediately after each index row is fetched or inserted. These index types provide the highest concurrency without deadlock conditions. Hash indexesShare/exclusive hash-bucket-level locks are used for read/write access. Locks are released after the whole bucket is processed. Bucket-level locks provide better concurrency than index-level ones, but deadlock is possible since the locks are held longer than one index operation. GIN indexesShort-term share/exclusive page-level locks are used for read/write access. Locks are released immediately after each index row is fetched or inserted. But note that insertion of a GIN-indexed value usually produces several index key insertions per row, so GIN might do substantial work for a single value's insertion. Currently, B-tree indexes offer the best performance for concurrent applications; since they also have more features than hash indexes, they are the recommended index type for concurrent applications that need to index scalar data. When dealing with non-scalar data, B-trees are not useful, and GiST or GIN indexes should be used instead.

Performance Tips SQL优化 #seq_page_cost = 1.0 执行计划
#random_page_cost = 4.0 影响执行计划的参数 #cpu_tuple_cost = 0.01 #enable_bitmapscan = on #cpu_index_tuple_cost = 0.005 #enable_hashagg = on #cpu_operator_cost = #enable_hashjoin = on #effective_cache_size = 128MB #enable_indexscan = on #enable_material = on #default_statistics_target = 100 #enable_mergejoin = on #constraint_exclusion = partition #enable_nestloop = on #cursor_tuple_fraction = 0.1 #enable_seqscan = on #from_collapse_limit = 8 #enable_sort = on #join_collapse_limit = 8 #enable_tidscan = on Genetic Query Optimizer

Performance Tips 确保统计信息的及时更新对执行计划的优劣起到很大作用点击进入
PostgreSQL's statistics target and histogram_bounds PostgreSQL Statistics and Query Explain Introduction PostgreSQL 行评估算法 Controlling the Planner with Explicit JOIN Clauses 举例 CBO

Performance Tips CBO

Performance Tips CBO Principle autoanalyze 是否被继承；空值比例；平均长度；
唯一值个数(-1唯一)；最常见的值；最常见的值得占比；记录分bucket边界值；物理存储与该列的匹配顺性；

Performance Tips CBO Principle autoanalyze

Performance Tips

Performance Tips 举例 : Change choice of the planer.
digoal=> create table tbl_cbo_test (id int primary key,firstname text,lastname text); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "tbl_cbo_test_pkey" for table "tbl_cbo_test" CREATE TABLE digoal=> insert into tbl_cbo_test select generate_series(1, ),'zhou','digoal'; INSERT digoal=> explain analyze select * from tbl_cbo_test where id=100000; QUERY PLAN Index Scan using tbl_cbo_test_pkey on tbl_cbo_test (cost= rows=1 width=16) (actual time= rows=1 loops=1) Index Cond: (id = ) Total runtime: ms (3 rows)

Performance Tips digoal=> set enable_indexscan=off; SET
digoal=> explain analyze select * from tbl_cbo_test where id=100000; QUERY PLAN Bitmap Heap Scan on tbl_cbo_test (cost= rows=1 width=16) (actual time= rows=1 loops=1) Recheck Cond: (id = ) -> Bitmap Index Scan on tbl_cbo_test_pkey (cost= rows=1 width=0) (actual time= rows=1 loops=1) Index Cond: (id = ) Total runtime: ms (5 rows)

Performance Tips digoal=> set enable_bitmapscan=off; SET
digoal=> explain analyze select * from tbl_cbo_test where id=100000; QUERY PLAN Seq Scan on tbl_cbo_test (cost= rows=1 width=16) (actual time= rows=1 loops=1) Filter: (id = ) Total runtime: ms (3 rows)

Performance Tips 举例 : JOIN Tuning

Performance Tips 举例 : JOIN Tuning Which SQL is better !

Performance Tips 举例 : JOIN Tuning Explicit JOIN
What happen when SET join_collapse_limit = 1 and use the join SQL; 如果不限制, 查询的关联的表越多, 关联的顺序组合就越多, 会带来很大的生成执行计划的开销(穷举). join_collapse_limit 尽量把explict JOIN(除了FULL JOIN)涉及的表都放到一个列表, 以这个列表进行JOIN顺序的排列组合得到最佳执行计划.(而join_collapse_limit就是限制这个列表有多大, 或者说有几个表会放到这里面来进行排列组合) from_collapse_limit 与join_collapse_limit功能类似, 只是他针对的是子查询, 例如 SELECT * FROM x, y, (SELECT * FROM a, b, c WHERE something) AS ss WHERE somethingelse; 会把x,y,a,b,c放在一个列表中进行JOIN顺序的排列组合.前提是这个列表小于from_collapse_limit geqo_threshold -- 不值得推荐,因为会产生不可预估的执行计划, 随机产生. 虽然对复杂查询可以降低执行计划的时间.

Performance Tips 数据迁移性能相关关闭autocommit 使用COPY 移除索引移除Foreign Key 约束
加大maintenance_work_mem可以提高建索引速度加大checkpoint_segments , checkpoint_timeout Disable WAL Archival and Streaming Replication To do that, set archive_mode to off, wal_level to minimal, and max_wal_senders to zero before loading the dump. 数据导入完成后开启, standby需要重新从基础备份做. 关闭autovacuum, 数据导入后运行analyze. COPY commands will run fastest if you use a single transaction and have WAL archiving turned off. 使用pg_restore的并行参数

Day 2 PostgreSQL Day DBA QuickGuide

Day 2 PostgreSQL Client Applications Monitoring Database Activity
PostgreSQL Server Applications Procedure Language and Debug Database Physical Storage PostgreSQL Distinguishing Feature Server Administration Additional Supplied Modules Database Layout Database Performance Tuning Short Case Reliability Server Configuration Routine Database Maintenance Tasks Backup and Restore HA and Replication Stream Replication Cascade Stream Replication PostgreSQL-XC , PL/Proxy , pgpool

PostgreSQL Client Applications
clusterdb -- clusterdb is a utility for reclustering tables in a PostgreSQL database. It finds tables that have previously been clustered, and clusters them again on the same index that was last used. Tables that have never been clustered are not affected. clusterdb [connection-option...] [--verbose | -v] [--table | -t table ] [dbname] clusterdb [connection-option...] [--verbose | -v] [--all | -a] createdb -- create a new PostgreSQL database createdb [connection-option...] [option...] [dbname] [description] createlang -- install a PostgreSQL procedural language createlang [connection-option...] langname [dbname] createlang [connection-option...] --list | -l dbname createuser -- define a new PostgreSQL user account createuser [connection-option...] [option...] [username]

dropdb -- remove a PostgreSQL database dropdb [connection-option...] [option...] dbname droplang -- remove a PostgreSQL procedural language droplang [connection-option...] langname [dbname] droplang [connection-option...] --list | -l dbname dropuser -- remove a PostgreSQL user account dropuser [connection-option...] [option...] [username] ecpg -- embedded SQL C preprocessor ecpg [option...] file... pg_basebackup -- take a base backup of a PostgreSQL cluster pg_basebackup [option...]

pg_config -- retrieve information about the installed version of PostgreSQL pg_config [option...] pg_dump -- extract a PostgreSQL database into a script file or other archive file pg_dump [connection-option...] [option...] [dbname] pg_dumpall -- extract a PostgreSQL database cluster into a script file pg_dumpall [connection-option...] [option...] pg_restore -- restore a PostgreSQL database from an archive file created by pg_dump pg_restore [connection-option...] [option...] [filename] psql -- PostgreSQL interactive terminal psql [option...] [dbname [username]]

PostgreSQL Server Applications
initdb -- create a new PostgreSQL database cluster initdb [option...] --pgdata | -D directory pg_controldata -- display control information of a PostgreSQL database cluster pg_controldata [option] [datadir] pg_ctl -- initialize, start, stop, or control a PostgreSQL server pg_ctl init[db] [-s] [-D datadir] [-o initdb-options] pg_ctl start [-w] [-t seconds] [-s] [-D datadir] [-l filename] [-o options] [-p path] [-c] pg_ctl stop [-W] [-t seconds] [-s] [-D datadir] [-m s[mart] | f[ast] | i[mmediate] ] pg_ctl restart [-w] [-t seconds] [-s] [-D datadir] [-c] [-m s[mart] | f[ast] | i[mmediate] ] [-o options] pg_ctl reload [-s] [-D datadir] pg_ctl status [-D datadir] pg_ctl promote [-s] [-D datadir]

pg_ctl kill signal_name process_id pg_ctl register [-N servicename] [-U username] [-P password] [-D datadir] [-S a[uto] | d[emand] ] [-w] [-t seconds] [-s] [-o options] pg_ctl unregister [-N servicename] pg_resetxlog -- reset the write-ahead log and other control information of a PostgreSQL database cluster pg_resetxlog [-f] [-n] [-ooid ] [-x xid ] [-e xid_epoch ] [-m mxid ] [-O mxoff ] [-l timelineid,fileid,seg ] datadir postgres -- PostgreSQL database server. postgres is the PostgreSQL database server. In order for a client application to access a database it connects (over a network or locally) to a running postgres instance. The postgres instance then starts a separate server process to handle the connection. postgres [option...]

postmaster -- PostgreSQL database server. postmaster is a deprecated alias of postgres. postmaster [option...] 例子控制文件 src/bin/pg_controldata/pg_controldata.c

控制文件信息举例

当系统遭遇如下错误时必须进入单用户模式修复数据库 ERROR: database is not accepting commands to avoid wraparound data loss in database "mydb" HINT: Stop the postmaster and use a standalone backend to VACUUM in "mydb". 为什么会遭遇这个错误? 数据库中任何带relfrozenxid标记的记录,年龄不能超过2^31(二十亿);当数据库中存在年龄大于{(2^31)-1千万}的记录时,数据库将报类似如下提示: WARNING: database "mydb" must be vacuumed within transactions HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb". 如果忽略上面的警告,当数据库中存在年龄大于{(2^31)-1百万}的记录时,数据库将报类似如下错误: ERROR: database is not accepting commands to avoid wraparound data loss in database "mydb" HINT: Stop the postmaster and use a standalone backend to VACUUM in "mydb".

PostgreSQL single-user mode usage,like Repair Database bootstrapping模式启动postgres initdb 调用的就是bootstrapping模式, bootstrapping模式下的语法与普通模式下的语法也打不一样, 使用的是BKI接口. 例如initdb调用的$PGHOME/share/ postgres.bki. postgres.bki文件结构 create bootstrap one of the critical tables insert data describing at least the critical tables close Repeat for the other critical tables. create (without bootstrap) a noncritical table open insert desired data Repeat for the other noncritical tables. Define indexes and toast tables. build indices

BKI commands 略我的数据"消失"了? Use pg_resetxlog simulate tuple disappear within PostgreSQL

Database Physical Storage
$PGDATA

对应pg_default表空间, 如果新建的数据库没有指定默认表空间, 那么新建的数据库的默认表空间由参数#default_tablespace决定, 没配置的话就是pg_default表空间, 因此在这个数据库创建的对象未指定表空间的话都会创建在base目录下的数据库目录中.

查看当前的数据库对应的默认表空间查看当前系统中的表空间

查看数据库的oid 数据库的oid为目录名, 用默认表空间pg_default的数据库的目录建在$PGDATA/base下面默认表空间不是pg_default的, 数据库目录建在数据库创建时的默认表空间内.

以下图中表示production, template1, test, devel, marketing 库存放在base目录中的目录名.

global目录对应的是pg_global表空间这里存放的是PostgreSQL 集群的数据对象信息, 如pg_database, pg_roles 等 pg_clog目录存放数据库事务提交状态数据 pg_notify目录存放NOTIFY/LISTEN状态数据 pg_multixact目录存放select for share的事务状态数据, 用于共享行锁.

pg_serial目录 PostgreSQL 9.1带来的serializable 隔离级别, 里面存储已提交的serializable事务的状态信息. pg_stat_tmp目录收集统计信息如果产生临时文件将存放于此 pg_subtrans目录存放子事务状态数据信息 pg_tblspc目录存放新建的表空间的软链接信息 pg_twophase目录存放twophase事务的状态信息 pg_xlog目录或软链接(如果initdb时指定了pg_xlog的位置) 存放WAL日志文件 PG_VERSION文件 PostgreSQL的主版本号. 如9.1

pg_hba.conf文件客户端认证配置文件 pg_ident.conf文件和pg_hba.conf结合使用, 存储操作系统用户和连接时使用的数据库用户的map用户信息, mapname将用于pg_hba.conf 的ident认证方法. # MAPNAME SYSTEM-USERNAME PG-USERNAME postgresql.conf文件数据库配置文件 postmaster.opts文件最近一次数据库启动的时候创建的文件, 存储数据库启动时postgres的命令行选项参数等 postmaster.pid文件存储数据库当前运行的postmaster.pid, 数据库集群目录位置, postmaster进程启动时间, 监听的端口号, Unix-socket 目录, 监听地址, 共享内存段信息

Authenticate pg_hba.conf pg_shadow PostgreSQL Connection Limit PG_HBA Auth Method (Trust, Password, Ident, LDAP…) Listene Which Address Roles TYPE DATABASE USER CIDR-ADDRESS METHOD

数据对象文件主数据文件 , 通过pg_class.relfilenode或pg_relation_filenode()函数查看. 超过1GB或编译PostgreSQL时设置的 --with-segsize大小后, 会以相同的文件名加.1, ... 后缀新增文件. 每个表或索引都会有free space map, 记录page的空闲信息. 可通过pageinspect插件查看. 每个表或索引都会有visibility map,记录没有dead tuple的page信息. 可通过pageinspect插件查看. unlogged table and index also have _init suffix as initialiaztion fork.可通过pageinspect插件查看. TOAST表, pg_class.reltoastrelid. TOAST,The Oversized-Attribute Storage Technique how difference when update a table's column which it in TOAST or BASETABLE TOAST table with pgfincore

数据对象文件以下表示customer, order, product, employee, part存放在production库中的文件名.

数据对象文件以下表示customer表的第一个数据文件的块信息.

数据对象文件以下表示customer表的第一个数据文件的块内部的信息.

数据对象文件 tuple信息

Server Administration
Database Layout Reliability Server Configuration Routine Database Maintenance Tasks Backup and Restore HA and Replication Stream Replication Cascade Stream Replication PostgreSQL-XC , PL/Proxy , pgpool Monitoring Database Activity

Logical Layout Field Object Schema Database Instance(Top-Level)
Cluster Database(s) Schema(s) Table(s) Row(s) Column(s) Index(s) View(s) Function(s) Sequence(s) Other(s)

Physical Layout Archived Datafile(s) Datafile(s) Datafile(s)
Table Toast Index Sequence Archived Datafile(s) Datafile(s) Datafile(s) Datafile(s) Controlfile WALs

Process Layout Shared Memory Area IPC postmaster autovacuum launcher
APP Shared Memory Area IPC Handshake & authentication postmaster autovacuum launcher fork autovacuum worker backend process Shared buffer bgwriter WAL buffer WAL writer Datafiles XLOGs Archiver ARCH FILEs

Reliability 让数据库可靠的注意事项事务提交后确保这个事务未来可恢复吗?
事务返回成功前, 事务日志(xlog)写入磁盘, synchronous_commit = on 备份可恢复吗?恢复后确保数据一致吗? -- fsync = on . full_page_writes = on 必须写入非易失存储的数据已经写入到非易失存储了吗? write - through , write - back 关闭磁盘的write cache 只允许有断电保护的write cache. 主机异常DOWN机后重启数据库能不能起到一个一致的状态? PostgreSQL periodically writes full page images to permanent WAL storage before modifying the actual page on disk. -- full_page_writes = on 数据库异常DOWN机后重启数据库能不能起到一个一致的状态? Another risk of data loss is posed by the disk platter write operations themselves. Disk platters are divided into sectors, commonly 512 bytes each. Every physical read or write operation processes a whole sector. When a write request arrives at the drive, it might be for some multiple of 512 bytes (PostgreSQL typically writes 8192 bytes, or 16 sectors, at a time), and the process of writing could fail due to power loss at any time, meaning some of the 512-byte sectors were written while others were not. To guard against such failures, PostgreSQL periodically writes full page images to permanent WAL storage before modifying the actual page on disk. By doing this, during crash recovery PostgreSQLcan restore partially-written pages from WAL. If you have file-system software that prevents partial page writes (e.g., ZFS), you can turn off this page imaging by turning off thefull_page_writes parameter. Battery-Backed Unit (BBU) disk controllers do not prevent partial page writes unless they guarantee that data is written to the BBU as full (8kB) pages.

Reliability 让数据库可靠的注意事项事务日志可以用于恢复到任意时间点吗? 开启归档, 并且有良好的备份策略.
wal_level = archive 或 hot_standby 如果存储挂了怎么办? archive_mode = on archive_command = 'cp %p /backup/%f' 如果IDC挂了怎么办? 异地容灾, 如流复制. Another risk of data loss is posed by the disk platter write operations themselves. Disk platters are divided into sectors, commonly 512 bytes each. Every physical read or write operation processes a whole sector. When a write request arrives at the drive, it might be for some multiple of 512 bytes (PostgreSQL typically writes 8192 bytes, or 16 sectors, at a time), and the process of writing could fail due to power loss at any time, meaning some of the 512-byte sectors were written while others were not. To guard against such failures, PostgreSQL periodically writes full page images to permanent WAL storage before modifying the actual page on disk. By doing this, during crash recovery PostgreSQLcan restore partially-written pages from WAL. If you have file-system software that prevents partial page writes (e.g., ZFS), you can turn off this page imaging by turning off thefull_page_writes parameter. Battery-Backed Unit (BBU) disk controllers do not prevent partial page writes unless they guarantee that data is written to the BBU as full (8kB) pages.

Reliability PostgreSQL调用OS的sync WRITE函数. wal_sync_method=? KERNEL
fsync PostgreSQL调用OS的sync WRITE函数. wal_sync_method=? PostgreSQL KERNEL OS buffer cache 通常有断电保护, 为了提高性能建议打开存储控制器write cache 通常没有断电保护, 所以要关闭硬盘write cache 硬盘(非易失存储)

Reliability ⑵ ⑴ PITR 基础备份+归档日志 Mistake Checkpoint Time Line
Data Changed 1. Compare pd_lsn 2. Dumpd to Disk 1.写WAL(或叫XLOG) 2.为了确保数据库在recovery的时候,可以恢复到一个一致的状态,shared buffer中的脏数据页在flush到磁盘数据文件中之前, 应该确保这个脏页的改变量已经write through 到XLOG文件了. 3.如何确保先写XLOG再改变DATA-PAGE呢?PAGE头信息里面包含了一个pd_lsn位, 用于记录XLOG写该PAGE信息的最后一个字节的下一个字节. 4. 在写脏页到数据文件前只要确保大于或等于pd_lsn的XLOG已经write through到磁盘了就行. lsn : log sequence number --- in practice, a WAL file location ⑴ WAL Which Page the first Modified after Checkpoint Write full page to WAL. Archive Online Backup File PITR 基础备份+归档日志

Server Configuration 查看当前参数配置
SHOW ALL; SHOW ENABLE_SEQSCAN; pg_settings; 还原默认参数值(还原到优先级最高的默认参数值) RESET configuration_parameter; RESET ALL; SET configuration_parameter TO DEFAULT; -- 优先级从高到低会话级参数配置 SET ENABLE_SEQSCAN TO OFF; 用户级参数配置 ALTER ROLE SET ENABLE_SEQSCAN TO OFF; 数据库级参数配置 ALTER DATABASE SET ENABLE_SEQSCAN TO OFF; 命令行参数 -- postgres -c log_connections=yes -c log_destination='syslog' 环境变量参数 -- env PGOPTIONS='-c geqo=off' 默认参数配置 $PGDATA/postgresql.conf

Server Configuration 参数值类型
Boolean, integer, floating point, string or enum Boolean values can be written ason, off, true, false, yes, no, 1, 0 (all case-insensitive) or any unambiguous prefix of these. 参数值单位 Default units can be found by referencing pg_settings.unit. For convenience, a different unit can also be specified explicitly. Valid memory units are kB (kilobytes), MB (megabytes), and GB (gigabytes); Note that the multiplier for memory units is 1024, not 1000. valid time units are ms (milliseconds), s (seconds), min (minutes), h (hours), and d (days). The allowed values can be found frompg_settings.enumvals. Enum parameter values are case-insensitive.

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Autovacuum autovacuum_freeze_max_age \N Autovacuum autovacuum_max_workers \N Autovacuum autovacuum_vacuum_cost_delay \N ms Autovacuum autovacuum_analyze_scale_factor \N \N Autovacuum autovacuum_analyze_threshold \N Autovacuum autovacuum_naptime \N s Autovacuum autovacuum_vacuum_cost_limit \N Autovacuum autovacuum_vacuum_threshold \N Autovacuum autovacuum_vacuum_scale_factor \N \N Autovacuum autovacuum on \N \N \N \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Client Connection Defaults / Locale and Formatting server_encoding UTF8 \N \N \N \N Client Connection Defaults / Locale and Formatting lc_collate C \N \N \N \N Client Connection Defaults / Locale and Formatting lc_ctype C \N \N \N \N Client Connection Defaults / Locale and Formatting lc_messages C \N \N \N \N Client Connection Defaults / Locale and Formatting timezone_abbreviations Default \N \N \N \N Client Connection Defaults / Locale and Formatting extra_float_digits \N Client Connection Defaults / Locale and Formatting TimeZone PRC \N \N \N \N Client Connection Defaults / Locale and Formatting client_encoding UTF8 \N \N \N \N Client Connection Defaults / Locale and Formatting DateStyle ISO, MDY \N \N \N \N Client Connection Defaults / Locale and Formatting lc_time C \N \N \N \N Client Connection Defaults / Locale and Formatting default_text_search_config pg_catalog.english \N \N \N\N Client Connection Defaults / Locale and Formatting lc_numeric C \N \N \N \N Client Connection Defaults / Locale and Formatting lc_monetary C \N \N \N \N Client Connection Defaults / Locale and Formatting IntervalStyle postgres \N \N {postgres,postgres_verbose,sql_standard,iso_8601} \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Client Connection Defaults / Other Defaults local_preload_libraries \N \N \N \N Client Connection Defaults / Other Defaults dynamic_library_path $libdir \N \N \N \N Client Connection Defaults / Other Defaults tcp_keepalives_idle \N s Client Connection Defaults / Other Defaults gin_fuzzy_search_limit \N Client Connection Defaults / Other Defaults tcp_keepalives_interval \N s Client Connection Defaults / Other Defaults tcp_keepalives_count \N Client Connection Defaults / Statement Behavior session_replication_role origin \N \N {origin,replica,local} \N Client Connection Defaults / Statement Behavior statement_timeout \N ms Client Connection Defaults / Statement Behavior check_function_bodies on \N \N \N \N Client Connection Defaults / Statement Behavior vacuum_freeze_table_age \N Client Connection Defaults / Statement Behavior xmlbinary base64 \N \N {base64,hex} \N Client Connection Defaults / Statement Behavior temp_tablespaces \N \N \N \N Client Connection Defaults / Statement Behavior xmloption content \N \N {content,document} \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Client Connection Defaults / Statement Behavior bytea_output hex \N \N {escape,hex} \N Client Connection Defaults / Statement Behavior vacuum_freeze_min_age \N Client Connection Defaults / Statement Behavior search_path "$user",public \N \N \N \N Client Connection Defaults / Statement Behavior default_tablespace \N \N \N \N Client Connection Defaults / Statement Behavior default_transaction_deferrable off \N \N \N \N Client Connection Defaults / Statement Behavior default_transaction_isolation read committed \N \N {serializable,"repeatable read","read committed","read uncommitted"} \N Client Connection Defaults / Statement Behavior default_transaction_read_only off \N \N \N \N Client Connection Defaults / Statement Behavior transaction_read_only off \N \N \N \N Client Connection Defaults / Statement Behavior transaction_isolation read committed \N \N \N \N Client Connection Defaults / Statement Behavior transaction_deferrable off \N \N \N \N Connections and Authentication / Connection Settings max_connections \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Connections and Authentication / Connection Settings listen_addresses localhost \N \N \N \N Connections and Authentication / Connection Settings unix_socket_group \N \N \N \N Connections and Authentication / Connection Settings unix_socket_directory \N \N \N \N Connections and Authentication / Connection Settings bonjour_name \N \N \N \N Connections and Authentication / Connection Settings bonjour off \N \N \N \N Connections and Authentication / Connection Settings superuser_reserved_connections \N Connections and Authentication / Connection Settings unix_socket_permissions \N Connections and Authentication / Connection Settings port \N Connections and Authentication / Security and Authentication ssl off \N \N \N \N Connections and Authentication / Security and Authentication ssl_ciphers \N \N\N \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Connections and Authentication / Security and Authentication db_user_namespace off \N \N \N \N Connections and Authentication / Security and Authentication authentication_timeout \N s Connections and Authentication / Security and Authentication krb_server_keyfile \N \N \N \N Connections and Authentication / Security and Authentication krb_caseins_users off \N \N \N \N Connections and Authentication / Security and Authentication krb_srvname postgres \N \N \N \N Connections and Authentication / Security and Authentication ssl_renegotiation_limit \N kB Connections and Authentication / Security and Authentication password_encryption on \N \N \N \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Customized Options custom_variable_classes \N \N \N \N Developer Options ignore_system_indexes off \N \N \N \N Developer Options post_auth_delay \N s Developer Options allow_system_table_mods on \N \N \N \N Developer Options trace_recovery_messages log \N \N {debug5,debug4,debug3,debug2,debug1,log,notice,warning,error} \N Developer Options pre_auth_delay \N s Developer Options zero_damaged_pages off \N \N \N \N Developer Options debug_assertions off \N \N \N \N Developer Options trace_sort off \N \N \N \N Developer Options trace_notify off \N \N \N \N Error Handling restart_after_crash on \N \N \N \N Error Handling exit_on_error off \N \N \N \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit File Locations config_file /pgdata/digoal/1921/data02/pg_root/postgresql.conf \N \N \N \N File Locations hba_file /pgdata/digoal/1921/data02/pg_root/pg_hba.conf \N \N \N \N File Locations data_directory /pgdata/digoal/1921/data02/pg_root \N \N \N \N File Locations ident_file /pgdata/digoal/1921/data02/pg_root/pg_ident.conf \N \N \N \N File Locations external_pid_file \N \N \N \N Lock Management max_pred_locks_per_transaction \N Lock Management max_locks_per_transaction \N Lock Management deadlock_timeout \N ms Preset Options server_version \N \N \N \N Preset Options wal_block_size \N Preset Options server_version_num \N Preset Options block_size \N Preset Options segment_size \N kB Preset Options integer_datetimes on \N \N \N \N Preset Options max_index_keys \N Preset Options wal_segment_size \N kB Preset Options max_identifier_length \N Preset Options max_function_args \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Query Tuning / Genetic Query Optimizer geqo_effort \N Query Tuning / Genetic Query Optimizer geqo on \N \N \N \N Query Tuning / Genetic Query Optimizer geqo_generations \N Query Tuning / Genetic Query Optimizer geqo_pool_size \N Query Tuning / Genetic Query Optimizer geqo_seed \N \N Query Tuning / Genetic Query Optimizer geqo_selection_bias \N \N Query Tuning / Genetic Query Optimizer geqo_threshold \N Query Tuning / Other Planner Options constraint_exclusion partition \N \N {partition,on,off} \N Query Tuning / Other Planner Options from_collapse_limit \N Query Tuning / Other Planner Options cursor_tuple_fraction \N \N Query Tuning / Other Planner Options join_collapse_limit \N Query Tuning / Other Planner Options default_statistics_target \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Query Tuning / Planner Cost Constants cpu_operator_cost e \N \N Query Tuning / Planner Cost Constants effective_cache_size \N kB Query Tuning / Planner Cost Constants cpu_index_tuple_cost e \N \N Query Tuning / Planner Cost Constants cpu_tuple_cost e \N \N Query Tuning / Planner Cost Constants seq_page_cost e \N \N Query Tuning / Planner Cost Constants random_page_cost e \N \N Query Tuning / Planner Method Configuration enable_hashjoin on \N \N \N \N Query Tuning / Planner Method Configuration enable_indexscan on \N \N \N \N Query Tuning / Planner Method Configuration enable_material on \N \N \N \N Query Tuning / Planner Method Configuration enable_mergejoin on \N \N \N \N Query Tuning / Planner Method Configuration enable_tidscan on \N \N \N \N Query Tuning / Planner Method Configuration enable_sort on \N \N \N \N Query Tuning / Planner Method Configuration enable_nestloop on \N \N \N \N Query Tuning / Planner Method Configuration enable_seqscan off \N \N \N \N Query Tuning / Planner Method Configuration enable_bitmapscan on \N \N \N \N Query Tuning / Planner Method Configuration enable_hashagg on \N \N \N \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Replication / Master Server max_wal_senders \N Replication / Master Server wal_keep_segments \N Replication / Master Server synchronous_standby_names \N \N \N \N Replication / Master Server wal_sender_delay \N ms Replication / Master Server replication_timeout \N ms Replication / Master Server vacuum_defer_cleanup_age \N Replication / Standby Servers hot_standby off \N \N \N \N Replication / Standby Servers max_standby_streaming_delay \N ms Replication / Standby Servers hot_standby_feedback off \N \N \N \N Replication / Standby Servers wal_receiver_status_interval \N s Replication / Standby Servers max_standby_archive_delay \N ms

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Reporting and Logging / What to Log log_disconnections off \N \N \N \N Reporting and Logging / What to Log log_connections off \N \N \N \N Reporting and Logging / What to Log log_line_prefix \N \N \N \N Reporting and Logging / What to Log log_autovacuum_min_duration \N ms Reporting and Logging / What to Log log_hostname off \N \N \N \N Reporting and Logging / What to Log log_timezone PRC \N \N \N \N Reporting and Logging / What to Log log_checkpoints off \N \N \N \N Reporting and Logging / What to Log log_statement none \N \N {none,ddl,mod,all} \N Reporting and Logging / What to Log log_duration off \N \N \N \N Reporting and Logging / What to Log log_error_verbosity default \N \N {terse,default,verbose} \N Reporting and Logging / What to Log log_lock_waits off \N \N \N \N Reporting and Logging / What to Log log_temp_files \N kB Reporting and Logging / What to Log debug_pretty_print on \N \N \N \N Reporting and Logging / What to Log debug_print_parse off \N \N \N \N Reporting and Logging / What to Log debug_print_plan off \N \N \N \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Reporting and Logging / What to Log application_name psql \N \N \N \N Reporting and Logging / What to Log debug_print_rewritten off \N \N \N \N Reporting and Logging / When to Log log_min_duration_statement \N ms Reporting and Logging / When to Log log_min_messages warning \N \N {debug5,debug4,debug3,debug2,debug1,info,notice,warning,error,log,fatal,panic} \N Reporting and Logging / When to Log log_min_error_statement error \N \N {debug5,debug4,debug3,debug2,debug1,info,notice,warning,error,log,fatal,panic} \N Reporting and Logging / When to Log client_min_messages notice \N \N {debug5,debug4,debug3,debug2,debug1,log,notice,warning,error} \N Reporting and Logging / Where to Log silent_mode off \N \N \N \N Reporting and Logging / Where to Log logging_collector off \N \N \N \N Reporting and Logging / Where to Log log_rotation_size \N kB Reporting and Logging / Where to Log log_truncate_on_rotation off \N \N \N \N Reporting and Logging / Where to Log log_destination stderr \N \N \N \N Reporting and Logging / Where to Log log_filename postgresql-%Y-%m-%d_%H%M%S.log \N \N \N \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Reporting and Logging / Where to Log log_rotation_age \N min Reporting and Logging / Where to Log log_directory pg_log \N \N \N \N Reporting and Logging / Where to Log syslog_ident postgres \N \N \N \N Reporting and Logging / Where to Log syslog_facility local0 \N \N {local0,local1,local2,local3,local4,local5,local6,local7} \N Reporting and Logging / Where to Log log_file_mode \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Resource Usage / Asynchronous Behavior effective_io_concurrency \N Resource Usage / Background Writer bgwriter_lru_multiplier \N \N Resource Usage / Background Writer bgwriter_lru_maxpages \N Resource Usage / Background Writer bgwriter_delay \N ms Resource Usage / Cost-Based Vacuum Delay vacuum_cost_page_dirty \N Resource Usage / Cost-Based Vacuum Delay vacuum_cost_page_hit \N Resource Usage / Cost-Based Vacuum Delay vacuum_cost_page_miss \N Resource Usage / Cost-Based Vacuum Delay vacuum_cost_delay \N ms Resource Usage / Cost-Based Vacuum Delay vacuum_cost_limit \N Resource Usage / Kernel Resources shared_preload_libraries \N \N \N \N Resource Usage / Kernel Resources max_files_per_process \N Resource Usage / Memory track_activity_query_size \N Resource Usage / Memory shared_buffers \N kB Resource Usage / Memory max_prepared_transactions \N Resource Usage / Memory max_stack_depth \N kB Resource Usage / Memory temp_buffers \N kB Resource Usage / Memory work_mem \N kB Resource Usage / Memory maintenance_work_mem \N kB

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Statistics / Monitoring log_parser_stats off \N \N \N \N Statistics / Monitoring log_executor_stats off \N \N \N \N Statistics / Monitoring log_planner_stats off \N \N \N \N Statistics / Monitoring log_statement_stats off \N \N \N \N Statistics / Query and Index Statistics Collector stats_temp_directory pg_stat_tmp \N \N \N \N Statistics / Query and Index Statistics Collector track_functions none \N \N {none,pl,all} \N Statistics / Query and Index Statistics Collector track_activities on \N \N \N \N Statistics / Query and Index Statistics Collector track_counts on \N \N \N \N Statistics / Query and Index Statistics Collector update_process_title on \N \N \N \N

Server Configuration category, name, setting, min_val, max_val, enumvals, unit Version and Platform Compatibility / Other Platforms and Clients transform_null_equals off \N \N \N \N Version and Platform Compatibility / Previous PostgreSQL Versions lo_compat_privileges off \N \N \N \N Version and Platform Compatibility / Previous PostgreSQL Versions quote_all_identifiers off \N \N \N \N Version and Platform Compatibility / Previous PostgreSQL Versions synchronize_seqscans on \N \N \N \N Version and Platform Compatibility / Previous PostgreSQL Versions sql_inheritance on \N \N \N \N Version and Platform Compatibility / Previous PostgreSQL Versions escape_string_warning on \N \N \N \N Version and Platform Compatibility / Previous PostgreSQL Versions backslash_quote safe_encoding \N \N {safe_encoding,on,off} \N Version and Platform Compatibility / Previous PostgreSQL Versions array_nulls on \N \N \N \N Version and Platform Compatibility / Previous PostgreSQL Versions default_with_oids off \N \N \N \N

Server Configuration Version and Platform Compatibility / Previous PostgreSQL Versions standard_conforming_strings on \N \N \N\N Write-Ahead Log / Archiving archive_mode off \N \N \N \N Write-Ahead Log / Archiving archive_command (disabled) \N \N \N \N Write-Ahead Log / Archiving archive_timeout \N s Write-Ahead Log / Checkpoints checkpoint_timeout \N s Write-Ahead Log / Checkpoints checkpoint_warning \N s Write-Ahead Log / Checkpoints checkpoint_completion_target \N \N Write-Ahead Log / Checkpoints checkpoint_segments \N Write-Ahead Log / Settings wal_level minimal \N \N {minimal,archive,hot_standby} \N Write-Ahead Log / Settings wal_buffers \N kB Write-Ahead Log / Settings fsync on \N \N \N \N Write-Ahead Log / Settings wal_sync_method fdatasync \N \N {fsync,fdatasync,open_sync} \N Write-Ahead Log / Settings wal_writer_delay \N ms Write-Ahead Log / Settings full_page_writes on \N \N \N \N Write-Ahead Log / Settings commit_delay \N Write-Ahead Log / Settings synchronous_commit on \N \N {local,on,off} \N Write-Ahead Log / Settings commit_siblings \N

Server Configuration 通常初始化完数据库后需要调整的参数
listen_addresses = ' ' -- 监听地址 port = 可更改为其他端口 max_connections = 最大允许的连接, 如果并发到达这么高可能需要500核的机器才能处理得过来, 否则性能会有下降. 有些业务不会释放连接, 所以可能导致连接占用多, 实际在处理请求的少. 因此这个数字看实际使用环境来设定. superuser_reserved_connections = 保留给超级用户的连接个数 unix_socket_directory = '/pgdata/digoal/1921/data02/pg_root' -- 默认是/tmp, 不太安全. 放到$PGDATA比较靠谱, 因为$PGDATA的目录权限是700的. unix_socket_permissions = 修改unix socket文件的权限为700 tcp_keepalives_idle = tcp连接空闲多长时间后发出keepalive包 tcp_keepalives_interval = 间隔多长时间再发一次 tcp_keepalives_count = 总共发几次keepalive包 shared_buffers = 512MB -- 在第一天的内容中包含了计算shared_buffers的方法 work_mem = 1MB -- 默认是1MB, 如果发现数据经常使用临时文件排序或group by等, 可以考虑设置为一个比较大的值.按需使用, 每个排序或merge JOIN用到的哈希表,DISTINCT,都需要消耗work_mem, 如果一个执行计划中有多个此类操作则最大需要使用多个work_mem.

Server Configuration maintenance_work_mem = 512MB -- 用于创建索引的操作,vacuum操作. 按需使用maintenance_work_mem设置的内存, 当有并发的创建索引和autovacuum等操作时可能造成内存消耗过度. max_stack_depth = 8MB -- 一般设置为ulimit 的stack size一致或略小. shared_preload_libraries = 'pg_stat_statements' -- 启动数据库集群时加载的库, 这里表示加载pg_stat_statements, 一个用于统计SQL执行次数, CPU开销等的模块. vacuum_cost_delay = 10ms -- VACUUM操作比较消耗IO, 设置延时是指VACUUM操作消耗的成本大于vacuum_cost_limit后延迟10毫秒再继续执行. bgwriter_delay = 10ms -- 每个background writer运行周期之间延迟10毫秒. wal_level = hot_standby -- WAL_level级别, 如果要开启备份必须设置为archive或hot_standby, 如果要建立hot_standy则必须设置为hot_standby. synchronous_commit = off -- 关闭XLOG的同步写. 可以大大提高写事务的处理能力. 不会破坏数据库一致性, 但是如果数据库异常DOWN机需要recovery时, 恢复后的数据库可能丢失最后10毫秒(wal_writer_delay)的事务. wal_sync_method = fdatasync -- 使用pg_test_fsync测试出系统使用哪种sync 接口效率最高. wal_buffers = 16384kB -- 一般繁忙的系统设置为xlog文件段的大小.

Server Configuration wal_writer_delay = 10ms -- WAL日志写操作round之间延迟10毫秒
commit_delay = 在事务提交的同时如果系统中有大于等于commit_siblings个未提交事务时, 等待0毫秒. 合并这些提交事务的IO请求,降低IO请求次数. commit_siblings = 5 checkpoint_segments = 多少个xlog rotate后触发checkpoint, checkpoint segments一般设置为大于shared_buffer的SIZE. 如shared_buffer=1024MB, wal文件单个16MB, 则checkpoint_segments>=1024/16; archive_mode = on -- 开启归档, 修改这个配置需要重启, 所以一般安装好就开启 archive_command = '/bin/date' -- 这个可以RELOAD, 一般的做法是先设置一个空转命令 max_wal_senders = 修改这个配置需要重启数据库, 所以一般的做法是先设置一个数字. wal_sender_delay = 10ms -- In each round the WAL sender sends any WAL accumulated since the last round to the standby server. It then sleeps for wal_sender_delay milliseconds, and repeats. The sleep is interrupted by transaction commit, so the effects of a committed transaction are sent to standby servers as soon as the commit happens, regardless of this setting. wal_keep_segments = 在主库中至少保留多少个xlog segment, 哪怕有一些XLOG已经不需要被数据库recovery使用.

Server Configuration #synchronous_standby_names = '' -- 如果打算配置同步流复制, 则需要配置这个参数. 同一时间只有一个同步复制角色standby, 如果这个节点挂了或者因为某些原因延迟了, 第二个配置节点将接替同步复制standby的角色. hot_standby = on -- 这个是standby节点的配置, 是否允许客户端连接standby进行readonly操作. max_standby_archive_delay = 300s -- 在规定的时间内必须完成archive standby的replay操作. 不影响接收操作.计时从最近一次replay赶上receive的时间开始算. max_standby_streaming_delay = 300s -- 在规定的时间内必须完成streaming standby的replay操作. 不影响接收操作.计时从最近一次replay赶上receive的时间开始算. wal_receiver_status_interval = 10s -- Specifies the minimum frequency for the WAL receiver process on the standby to send information about replication progress to the primary, where it can be seen using the pg_stat_replication view. The standby will report the last transaction log position it has written, the last position it has flushed to disk, and the last position it has applied. Updates are sent each time the write or flush positions change, or at least as often as specified by this parameter. Thus, the apply position may lag slightly behind the true position. hot_standby_feedback = on -- Specifies whether or not a hot standby will send feedback to the primary about queries currently executing on the standby.

Server Configuration random_page_cost = 2.0 -- 调小后更倾向使用索引, 而非全表扫描.
effective_cache_size = 12000MB -- 调大后更倾向使用索引, 而非全表扫描. log_destination = 'csvlog' -- 便于导入到库中进行分析 logging_collector = on log_directory = '/var/applog/pg_log/集群名/port号' log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' log_file_mode = 0600 log_truncate_on_rotation = on -- 便于维护日志文件 log_rotation_age = 1d -- 表示一天建立一个日志文件 log_rotation_size = 10MB -- 表示大于10MB后新建一个日志文件 log_min_duration_statement = 1000ms -- 记录运行时间超过1秒的SQL log_checkpoints = on -- 记录checkpoint的运行情况 log_lock_waits = on -- 记录锁等待时间 log_statement = 'ddl' -- 记录DDLSQL

Server Configuration track_activity_query_size = 记录SQL长度最大限度改为2048, 可以记录更长的SQL autovacuum = on -- 开启自动vacuum log_autovacuum_min_duration = 记录所有的auto vacuum动作 deadlock_timeout = 1s -- 死锁检测的最小值为1秒, 如果系统因为检测死锁造成压力较大可以调大这个值 custom_variable_classes = 'pg_stat_statements' -- pg_stat_statements模块的定制参数 pg_stat_statements.max = 1000 pg_stat_statements.track = all 危险设置, 将导致数据库CRASH后不可恢复或数据不一致. fsync = off full_page_writes = off

Server Configuration 模块参数
开发参数 -- 一般用于调试, 恢复等特殊场景. 命令行选项

Routine Database Maintenance Tasks
Routine Vacuuming 为什么要vacuum PostgreSQL的MVCC机制, 有很好的读些并发性以及极高的事务隔离性, 但是由于数据更新和删除操作后并没有在物理上从PAGE里面删除, 所以需要一种机制来回收这些费数据.否则会导致膨胀.一般的做法是让系统自动回收, 开启autovacuum. Preventing Transaction ID Wraparound Failures. autovacuum 如何跟踪哪些PAGE有脏数据需要回收 PostgreSQL 8.3以及更老的版本 max_fsm_pages Six bytes of shared memory are consumed for each page slot. max_fsm_relations Roughly seventy bytes of shared memory are consumed for each slot. 可能溢出, 跟踪不到. PostgreSQL 8.4以及更新的版本 fsm, vm文件(对应每个对象). 不会溢出, vm(no dead tuple pages)加入后可以大大降低扫描的块的数量.

Routine Vacuuming fsm结构 PostgreSQL8.4 Free Space Map Principle autovacuum在什么情况下会被触发 autovacuum = on autovacuum_vacuum_threshold = 50 autovacuum_analyze_threshold = 50 autovacuum_vacuum_scale_factor = 0.2 autovacuum_analyze_scale_factor = 0.1 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples analyze threshold = analyze base threshold + analyze scale factor * number of tuples -- Preventing Transaction ID Wraparound Failures. Autovacuum is invoked on any table that might contain XIDs older than the age specified by the configuration parameter autovacuum_freeze_max_age. (This will happen even if autovacuum is disabled.)

Routine Vacuuming autovacuum在什么情况下会被触发 If for some reason autovacuum fails to clear old XIDs from a table, the system will begin to emit warning messages like this when the database's oldest XIDs reach ten million transactions from the wraparound point: WARNING: database "mydb" must be vacuumed within transactions HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb". If these warnings are ignored, the system will shut down and refuse to start any new transactions once there are fewer than 1 million transactions left until wraparound: ERROR: database is not accepting commands to avoid wraparound data loss in database "mydb" HINT: Stop the postmaster and use a standalone backend to VACUUM in "mydb".

Routine Vacuuming 不同的触发场景分别扫描哪些块一个表对应的系统表pg_class中的relfrozenxid字段的值表示这个表所有记录中存在的最老的记录. 只有发生扫全表的vacuum后(请区别于VACUUM FULL),才会更新这个值. 当执行vacuum时, 决定要做什么?, 首先它会获取表的年龄age(pg_class.relforzenxid), 用这个年龄和下面的参数进行比较. vacuum根据vacuum_freeze_min_age参数的值来决定要把哪些行的版本号更新为FrozenXID(比任何版本号都老的版本号). vacuum根据vacuum_freeze_table_age参数的值来决定是否要扫表的所有块, 也就是扫完后可以更新pg_class中的relfrozenxid字段. 还有两种情况是vacuum freeze, 或所有PAGE都有dead row版本需要扫描. The whole table is scanned when relfrozenxid is more than vacuum_freeze_table_age transactions old, when VACUUM's FREEZE option is used, or when all pages happen to require vacuuming to remove dead row versions.

Routine Vacuuming 表级别的autovacuum参数

Routine Reindexing 为什么要reindex b-tree 索引频繁的non-HOT Update后b-tree索引会膨胀, B-tree index pages that have become completely empty are reclaimed for re-use. 一个全新的btree索引和一个频繁non-HOT Update后的b-tree索引的page逻辑顺序和物理顺序顺性不一样, 效率也不一样. concurrently rebuild bloated indexes 为什么要reindex non-b-tree 索引如果发现non-b-tree索引膨胀比较厉害, 并且性能下降严重的时候需要reindex他们. 监控 Use PostgreSQL collect and analyze Operation System statistics

Backup and Restore 备份数据或SQL
数据或SQL的备份, 支持不同版本的备份和恢复, 如果要将低版本的数据备份后还原到高版本的数据库中, 一般建议使用高版本的pg_dump备份, 高版本的pg_resotre还原. pg_dump 输出支持两种格式, 一种是纯文本格式, 另一种是PostgreSQL bin格式. pg_dump [connection-option...] [option...] [dbname] pg_dumpall 输出仅支持纯文本格式. pg_dumpall [connection-option...] [option...] COPY 类似纯文本格式的备份, 但是可以支持定制化备份列和行的信息.

Backup and Restore 备份数据或SQL pg_dump

Backup and Restore 备份数据或SQL pg_dumpall

Backup and Restore 备份数据或SQL COPY

Backup and Restore 备份数据文件, 增量备份, 可用于做基于时间点的恢复, 基于xid的恢复, 基于定制还原点的恢复
有效备份数据文件的前提 full_page_writes = on fsync = on wal_level = archive 或 hot_standby archive_mode = on archive_command = 'cp %p /backup/%f' pg_start_backup -- 排他. 同一时间只允许一个pg_start_backup运行. 备份$PGDATA, pg_tblspc中软链接对应的表空间目录 pg_xlog目录不需要备份 pg_stop_backup -- 停止备份. CHECKPOINT; pg_switch_xlog(); 备份在备份过程中产生的wal_archive pg_basebackup -- 一般被用于创建standby.

Backup and Restore 执行pg_start_backup后, $PGDATA目录生成一个backup_label文件
文件内容类似 START WAL LOCATION: 0/B (file C) CHECKPOINT LOCATION: 0/B BACKUP METHOD: pg_start_backup START TIME: :07:32 CST LABEL: test 执行pg_stop_backup后会在pg_xlog中生成一个备份完成标记的文件, 文件及内容如下

Backup and Restore pg_basebackup

Backup and Restore xlog.c

Backup and Restore 还原纯文本备份解读
pg_dump -F p -f ./test.dmp C -E UTF8 -b -h U postgres test cat test.dmp

Backup and Restore 纯文本备份解读

Backup and Restore 使用psql -f还原纯文本格式的备份 postgres=# drop database test;
psql -f ./test.dmp 使用pg_restore还原BIN格式的备份 pg_dump -F c -f ./test.dmp c -C -E UTF8 -b -h U postgres test pg_restore -v -d postgres -C -F c -h U postgres ./test.dmp c pg_restore: connecting to database for restore pg_restore: creating DATABASE test pg_restore: connecting to new database "test" pg_restore: connecting to database "test" as user "postgres" pg_restore: creating SCHEMA public

Backup and Restore 使用pg_restore还原BIN格式的备份
pg_restore: creating COMMENT SCHEMA public pg_restore: creating EXTENSION plpgsql pg_restore: creating COMMENT EXTENSION plpgsql pg_restore: creating FUNCTION f_void() pg_restore: creating SEQUENCE seq pg_restore: executing SEQUENCE SET seq pg_restore: creating TABLE userinfo pg_restore: creating VIEW v_test pg_restore: restoring data for table "userinfo" pg_restore: creating CONSTRAINT userinfo_pkey pg_restore: setting owner and privileges for DATABASE test pg_restore: setting owner and privileges for SCHEMA public pg_restore: setting owner and privileges for COMMENT SCHEMA public pg_restore: setting owner and privileges for ACL public pg_restore: setting owner and privileges for EXTENSION plpgsql

Backup and Restore 使用pg_restore还原BIN格式的备份
pg_restore: setting owner and privileges for COMMENT EXTENSION plpgsql pg_restore: setting owner and privileges for FUNCTION f_void() pg_restore: setting owner and privileges for SEQUENCE seq pg_restore: setting owner and privileges for SEQUENCE SET seq pg_restore: setting owner and privileges for TABLE userinfo pg_restore: setting owner and privileges for VIEW v_test pg_restore: setting owner and privileges for TABLE DATA userinfo pg_restore: setting owner and privileges for CONSTRAINT userinfo_pkey 通过编辑TOC文件定制还原

Backup and Restore PITR 解读recovery.conf文件
示例文件在 $PGHOME/share/recovery.conf.sample restore_command -- 还原归档文件的命令 recovery_target_name | recovery_target_time | recovery_target_xid 恢复到什么目标,其中recovery_target_name是使用pg_create_restore_point()生成的. recovery_target_inclusive (目标为时间和xid时可配置是否恢复到(包含|接近但不包含)指定的时间点或XID) recovery_target_timeline (恢复到哪个时间线,或latest表示直到最大的时间线) pause_at_recovery_target (恢复到指定点后暂停恢复, 一般可用于连到数据库去检测是否已经到达了想要恢复的时间点, 没到达的话可以关闭数据库调整恢复目标点, 继续恢复, 直到到了想要的点后, 使用pg_xlog_replay_resume()来停止恢复并激活数据库) 打开hot_standby才能使用. 基于单个表空间 / 数据库的还原和恢复

Backup and Restore timeLine=2 timeLine=1 WAL日志持续归档例子
如果要使用基础备份1 恢复到这个时间点, recovery_target_timeline=2 或latest 恢复后激活tL发生改变 timeLine=2 timeLine=1 假设这是 pg_create_restore_point('test2') 假设这是 pg_create_restore_point('test1') 基础备份1 WAL日志持续归档

HA and Replication 传统HA, 共享存储缺点, 存储或者链接存储的线路甚至HBA卡都会成为故障点, 所以HA并不HA.
通常需要靠存储复制或文件系统异机镜像来应对存储层面的故障. PostgreSQL流复制带来了新的解决方案 LVS+PG流复制可以组成读负载均衡的场景 PG流复制还可以应用于异地容灾的场景 PG流复制可以作为HA的一部分, 通过VIP漂移和激活同步 standby可以做到极高的数据可靠性和高可用. PgCloud, 一种不依赖虚拟化和集中式存储的思路

HA and Replication 2 Standby archlog xlog 1 Primary startup process
异步流复制原理 2 Standby archlog xlog 1 WalRcvData->receivedUpto Primary startup process postmaster wal receiver PM_SHUTDOWN_2 PMSignal send WalRcvData->conninfo WalRcvData->receiveStart wal sender 3 recover read fork datafiles postmaster XLOG WAL buffer write fork read write wal writer APP Other backend processes

Parameter Tuning : Primary Standby max_wal_senders
wal_sender_delay ( The sleep is interrupted by transaction commit ) wal_keep_segments vacuum_defer_cleanup_age ( the number of transactions by which VACUUM and HOT updates will defer cleanup of dead row versions. ) Standby hot_standby # wal apply & SQL on standby conflict reference parameter max_standby_archive_delay ( the maximum total time allowed to apply any one WAL segment's data. ) max_standby_streaming_delay ( the maximum total time allowed to apply WAL data once it has been received from the primary server ) wal_receiver_status_interval ( minimum frequency, The standby will report the last transaction log position it has written, the last position it has flushed to disk, and the last position it has applied.) hot_standby_feedback (send feedback to the primary about queries currently executing on the standby. )

HA and Replication 2 Standby(s) archlog xlog 1 Primary startup process
同步流复制原理 1. Many standbys 2. Only one sync standby at the same time 3. Sync wide: Cluster Database User Session Transaction 2 Standby(s) archlog xlog 1 WalRcvData->receivedUpto Primary startup process postmaster wal receiver Feedback replication progress wal flush position PM_SHUTDOWN_2 PMSignal send WalRcvData->conninfo WalRcvData->receiveStart wal sender 3 recover read fork datafiles postmaster XLOG WAL buffer write fork read write wal writer APP Other backend processes waiting/releasing

HA and Replication Parameter Tuning : Primary Standby max_wal_senders
wal_sender_delay wal_keep_segments vacuum_defer_cleanup_age synchronous_replication synchronous_standby_names ( primary_conninfo in standby’s primary_conninfo ) Standby hot_standby max_standby_archive_delay max_standby_streaming_delay wal_receiver_status_interval hot_standby_feedback

HA and Replication 流复制hot_standby演示规划主机, 网络, 存储, 同步主备机器的时间生成主库
配置主库postgresql.conf, pg_hba.conf 新建replication角色配置hot_standby .pgpass, 数据目录使用pg_basebackup创建备库基础备份配置备库recovery.conf, postgresql.conf 启动hot_standby 测试, 新建用户, 表空间, 数据库, schema, 数据表. 使用pgbench进行压力测试角色切换测试 PostgreSQL 9.2 级联流复制 create table test (id int primary key,crt_time timestamp(0)); create index idx_test_crt_time on test (crt_time); 不停的插入, hot_standby通过查询, select max(now()),max(crt_time) from test;辨别延时. 或者通过主库的pg_stat_replication查看. pg_last_xlog_receive_location pg_last_xlog_replay_location pg_last_xact_replay_timestamp

HA and Replication 数据库复制技术基于触发器的复制 slony-I, bucardo, londiste
基于SQL分发的复制 pgpool, continue 基于PAGE改变日志的复制流复制, 归档复制

Scale-Out PG-XC pl/proxy pgpool-II

Monitoring Database Activity
long sql lock unused index dead tuples ratio server load server rtps server wtps server iowait server swap page in/out server process/s error | warning log pg_stat_statements CPU used by one SQL Use PostgreSQL collect and analyze Operation System statistics

Procedure Language 支持多种语言, perl, python, tcl , plpgsql等.
plpgsql函数在PostgreSQL中作为一个事务处理, 当触发了exception时, exception中的内容作为另一个事务. Debug plpgsql Function PostgreSQL 2-PC Transaction

Additional Supplied Modules
去哪里找模块比较常用的模块 auto_explain -- 自动记录超过设定运行时间的SQL执行时的执行计划 dblink -- 数据库链接, 可用于链接远程数据库 file_fdw -- 基于文件创建外部表 pageinspect -- 用于查看表或索引的PAGE以及ITEM的信息 pg_archivecleanup -- 清除归档的模块

比较常用的模块 pgbench -- 压力测试模块 pg_buffercache -- 查看buffer信息的模块 pg_freespacemap -- 查看freespacemap信息的模块 pgrowlocks -- 查看行锁的模块 pg_stat_statements -- 统计数据库执行的SQL语句的次数以及CPU开销的模块 pgstattuple -- 获得tuple级统计信息的模块

比较常用的模块 pg_test_fsync -- 调用各种OS同步写接口的测试模块 pg_trgm -- 可用于近似度匹配的模块 pg_upgrade -- 基于catalog的变更升级模块, 通常比pg_dump升级速度快很多倍. Foreign data wrapper -- 建立外部表的模块 -- 参见day1的fdw章节 pgfincore -- 模拟持久化缓存的模块其他常见的支持PostgreSQL的外围软件全文检索Sphinx, 地理信息PostGIS

Tuning case PostgreSQL性能优化综合案例

PostgreSQL 9.1.3 2Day DBA QuickGuide
Thanks PostgreSQL Day DBA QuickGuide 关于本PPT有问题请发邮件至保持联系, 个人QQ: 群: 【参考】《PostgreSQL Manual》《PostgreSQL 9 Administration Cookbook》《PostgreSQL 9.0 High Performance》【更多内容请关注】

中国2012PostgreSQL用户大会地点: 北京人民大学时间: 6月14-17号内容简介:
本次大会将邀请到社区的核心组员MagnusHagander,主要开发人员Simon Rigg、PG-XC的首席架构师铃木幸一(SuzukiKoichi)及其他海外专家为本次大会分享最前沿的PostgreSQL方面的信息，同时还有业界的资深人员作相关演讲。 PG-XC峰会数据库应用分会场内核开发分会场管理与性能调优分会场 PostgreSQL ACE颁奖主题: 开放征集中

PostgreSQL Day DBA QuickGuide Ver: 0.11

Similar presentations

Presentation on theme: "PostgreSQL Day DBA QuickGuide Ver: 0.11"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

PostgreSQL Day DBA QuickGuide Ver: 0.11

Similar presentations

Presentation on theme: "PostgreSQL Day DBA QuickGuide Ver: 0.11"— Presentation transcript:

Similar presentations

About project

反馈