Open source PostgreSQL marks a turning point with performance, analytics advances
Today, organizations are storing, managing and analysing increasing volumes of data – and PostgreSQL is evolving to meet these extreme processing demands.
The newest version of the open source database, PostgreSQL 9.5, released by the PostgreSQL Community on January 7, 2016, marked a turning point for using PostgreSQL with data-driven, high-speed, high-volume applications.
New PostgreSQL features include:
- Analytics and productivity enhancements to speed complex query capabilities on extreme data volumes, planting PostgreSQL more squarely in the data warehousing space
- Performance boosters for more powerful high-end servers with high core counts and large memory footprints that companies are using today for high-speed applications
- A foundation for horizontal scalability across multiple servers that today helps DBAs access data from external databases like MongoDB and Hadoop
Data Analytics and Productivity Enhancements
PostgreSQL 9.5 delivers a series of new features to address the needs of analytics teams. Many prior feature advancements have targeted database performance or manageability and productivity for the DBAs. However, as PostgreSQL has spread across the enterprise, the PostgreSQL development community moved towards satisfying organizations’ feature requirements. One of these groups to voice an interest and get a response with new features in PostgreSQL 9.5 was analytics experts. While the database still may not yet be ready to become a prime time data warehouse for large-scale use (for example, petabyte sized warehouses), the PostgreSQL Community added new features to address the need for greater analysis and make it easier for DBAs to prepare data for greater study.
The key features in PostgreSQL 9.5 for Data Analytics are BRIN, Grouping Sets, CUBE, and Rollup:
BRIN, the headline big data feature for the release, addresses larger data volumes. BRIN (Block Range Indexes) solves the problem for DBAs who run regular queries over vast amounts of data in individual tables. For example, data analysis teams may be querying huge tables containing records detailing website activity or sales data.
Previously, PostgreSQL DBAs would typically create BTREE indexes over such tables to allow them to run queries without having to scan the entire table. As those indexes would contain pointers to every occurrence of each value in the table that the index covered, they could be very large themselves, and whilst using such indexes is far more efficient than scanning the table directly, the size of the index could still equate to a significant amount of I/O to find the desired data.
BRIN indexes solve this problem by indexing ranges of values rather than each individual value. Data types with a linear sort order might index minimum and maximum values within a block range on the underlying table, whilst geometric types might store the bounding box for all the objects in the block range. This results in a less precise, but significantly smaller index. Instead of using a huge index to find precise locations of rows, we use a very small index to find the approximate locations, and then scan those locations directly to find the matching rows within them.
Grouping sets enable DBAs to run multiple ‘GROUP BYs’ in a single query. Instead of having to run individual queries, the DBA can set up more complex questions. In a large company, for example, DBAs can set up a granular, multi-level query that asks the database to group employees by division, then by department, then by shift, then by workgroup. This used to require individual queries, a time-consuming process for very large organizations.
The CUBE, in PostgreSQL analytics, is a new keyword that will calculate all of the possible combinations of the data within defined parameters. This enables DBAs to identify a sales region, for example, set up a CUBE and get all possible data from those tables, such as total revenue, time of sales, products sold, by whom, to whom, etc. It provides DBAs with a short cut instead of having to query the entire database for records pertaining just to that region or having to run individual open-ended queries, both of which are time-consuming processes.
Rollups are a very simple concept. They enable DBAs to prioritize the results of a query so what comes back are a series of subtotals and a grand total, for example. It enables the database to add up data across multiple dimensions at once, treating some data as superior. For financial reports, this is extremely important. It replaces a more manual process.
EnterpriseDB (EDB) co-led the development of performance enhancements for PostgreSQL 9.5 with members of the open source community to be included in the new release. Working with large enterprise customers and partnering with hardware vendors, EDB saw first-hand a variety of enterprise environments with new, high-performance servers and larger memory footprints, and how these affected PostgreSQL performance. EDB and members of the open source community worked to optimize PostgreSQL, emphasizing memory access, concurrency and locking mechanisms to speed Postgres performance in these high-performance environments.
Performance and scalability advances in PostgreSQL 9.5 include:
- Shared buffer concurrency: PostgreSQL 9.5 can support greater data volume, thereby increasing performance as a result of higher optimization of the locking regime in shared buffers.
- Expanded concurrent locking: With a reduced number of page locks and pins holding indexes during scans, PostgreSQL 9.5 can support more concurrent users in high-transactional systems, which also increases performance.
- Buffer mapping: With enhanced shared buffer management, PostgreSQL 9.5 can support a greater number of shared buffer partitions.
- Prefix sorting: Using an abbreviated sorting optimization, PostgreSQL 9.5 can sort large text and numeric fields 20% to 30% faster than PostgreSQL 9.4, according to benchmark tests.
EDB has been working for the past eight years with Japan’s Nippon Telegraph and Telephone (NTT) – a Fortune 500 Japanese telecom company that contributes to the open source PostgreSQL project – on Postgres-XC, a prototype of an external sharding environment. Postgres-XC breaks up PostgreSQL databases into pieces and spreads them across multiple servers using a transaction manager that tracks the location of the data, ensuring it remains atomic. This has been a challenge for relational databases, such as PostgreSQL, to maintain connections across multiple machines, because every record is interconnected. Building Postgres-XC required NTT to rewrite some of the core PostgreSQL code to eliminate the requirement that all data had to be controlled directly by the database.
In mid-2014, NTT and EDB started examining how to use Foreign Data Wrappers (FDWs) to achieve this inside Postgres because of the ability of FDWs to connect other databases to PostgreSQL. FDWs are features that enable PostgreSQL to read and write data from other databases and access it as PostgreSQL tables where developers can query it with SQL as if it were native PostgreSQL.
A new FDW feature in PostgreSQL 9.5, join pushdown provides the infrastructure to allow the authors of FDWs to push join operations down to the remote server, thus further distributing the workload required to process queries that include data from multiple tables on one or more remote servers. Another enhancement, IMPORT FOREIGN SCHEMA, automates importing of schemas from external databases connected to PostgreSQL through FDWs. This replaces a manual process when accessing data from outside sources and reduces the potential for errors.
The open source community is already looking ahead to PostgreSQL 9.6 and EDB and others are continuing work on numerous features and enhancements. We expect further performance improvements (including parallelism) and features required to meet our vision of using FDWs to support transparent data sharding to be the focus of many of EDBs developers. Today, however, the new Postgres has proven its ability to continually expand with the demands of users and advancing data center environments.