Advanced Query Construction using HTSQL

HTSQL is a high-level navigational query language and analytical toolkit for relational databases. It is written by Kirill Simonov and Clark Evans in the Python language. HTSQL works with SQLite, PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. HTSQL was created to support data analysts.

Presenter:Clark C. Evans, Prometheus Research, LLC
Date:Sept 16, 2011
Location:PgOpen US 2011, Chicago, IL

What is HTSQL?

img/what_is_it.png

HTSQL is a query language for answering business inquiries, quickly & correctly, the first time through.

More Technically

img/service.png

Interactive Dashboard in Minutes

img/htraf_source_new.png img/htraf_screenshot.png

Dashboard: Screenshot

img/htraf_screenshot.png

Dashboard: Source Code

img/htraf_source_new.png

Why HTSQL?

img/confessional.jpg

We wrote HTSQL to enable Data Analysts to self-serve both simple and complex data questions without having to spend time in a confessional booth for SQL mishaps.

Technically, something like an ORM, but sharable via the web and focused on complex reporting needs rather than CRUD operations.

Socially, a mechanism where business analysts could give working queries to the development staff, rather than the other way around.

In Need of a Shared Language

img/triangle.png

Collaborative Work Groups

img/sharing.jpg

The Accidental Programmer

SEQUEL was introduced in 1974 as "a query facility oriented toward users who are not computer specialists."

"[T]here is also a large class of users who, while they are not computer specialists, would be willing to learn to interact with a computer in a reasonably high-level, non-procedural query language. Examples of such users are accountants, engineers, architects, and urban planners. It is for this class of users that SEQUEL is intended."

D.Chamberlin, R.F. Boyce, SEQUEL: A Structured English Query Language. Proc. ACM SIGFIDET Conference, Ann Arbor, MI, May 1974 (Page 258)

Relational Algebra is a Poor Fit

img/poorfit.jpg

What is Navigational Model?

Let's use entity relationships as navigation, and build query language around this principle.

Theory follows practice.

Example University Schema

img/schema.png

Anatomy of a Query

img/3parts.png

SQL conflates Rows & Columns

"Please list departments; for each department,
show the corresponding school's campus."
SELECT d.code, d.name, s.campus
FROM ad.department AS d
LEFT JOIN ad.school AS s
       ON (d.school_code = s.code);

The business inquiry asks for a specific set of rows, and then correlated columns. The SQL encoding returns a subset of a cross product, making it difficult to ensure what each row represents. The FROM clause doesn't just pick rows, it also plays and an auxiliary role in choosing columns.

Keep Row Definition Separate

"Please list departments; for each department,
the show the corresponding school's campus."
/department{code, name, school.campus}

The navigational translation separates the row definition from the column selection. The linking is implicit, and correct. The encoded query can be read aloud as a verbal inquiry.

Optimized SQL is Less Clear

"For each department, return the department's
name and number of courses having more than 3 credit hours."
SELECT d.name, count(c)
FROM ad.department AS d
LEFT JOIN ad.course AS c
       ON (c.department_code = d.code
           AND c.credits > 3)
GROUP BY d.name;

To optimize, the subquery is replaced by a GROUP BY projection. This gives us both row/column and link/filter conflation, obfuscating the business inquiry.

Conflating Projection & Aggregation

"How many departments by campus?"

SELECT s.campus, COUNT(d)
FROM ad.school AS s
LEFT JOIN ad.department AS d
  ON (s.code = d.school_code)
WHERE s.campus IS NOT NULL
GROUP by s.campus;

In the schema there isn't a campus table, you have to take distinct values from the school table. In this SQL query its not clear if the GROUP BY is used only to produce an aggregate, you have to examine primary key columns to know for sure.

Keep Projection Separate

"How many departments by campus?"

/(school^campus) {campus, count(school.department)}

In a navigational approach, you first construct the projection explicitly (using ^ operator). Then, you select from it. In this way the aggregation is indicated separately as part of the column selector rather than being confused with the row definition.

For SQL, Complexity is Painful

"For each school with a degree program, return
the school's name, and the average number of high-credit (>3) courses its departments have."
SELECT s.name, o.avg_over_3 FROM ad.school AS s
JOIN ad.program AS p ON (p.school_code = s.code)
LEFT JOIN (
    SELECT d.school_code, AVG(COALESCE(i.over_3,0)) AS avg_over_3
    FROM ad.department d
    LEFT JOIN (
        SELECT c.department_code, COUNT(c) AS over_3
         FROM ad.course AS c WHERE c.credits > 3
         GROUP BY c.department_code
    ) AS i ON (i.department_code = d.code)
    GROUP BY d.school_code
) AS o ON (o.school_code = s.code)
GROUP BY s.name, o.avg_over_3;

Case Study: RexDB Constraints

img/rexdb.jpg img/rexdb_navigation.jpg

Case Study: RexDB Constraints

img/rexdb_constraint.jpg

Case Study: RexDB Constraints

Since rows & columns are clearly isolated, queries can be composed dynamically. The general template is:

/individual{id, check_1, check_2}?filter

For example, to return males participating in the 'aie' study, executing two completely orthogonal checks, we write:

/individual
 .define(srs_13_pair := !exists(srs ? q3>24 | q1>7),
         adi_sanity := ... )
 {id, adi_sanity, srs_13_pair}
 ? sex='m'&exists(participation.study = 'aie')

How do you use it?

There are several ways you can use HTSQL.

Via Command Line

HTSQL can be used via a command-line interpreter

$ htsql-ctl shell pgsql:htsql_regress
Interactive HTSQL Shell
Type 'help' for more information, 'exit' to quit the shell.

htsql_regress$ /count(student)
 | count(student) |
-+----------------+-
 |            456 |
            (1 row)

Direct use in Python

HTSQL can be used either as a WSGI component, or directly as a data query tool within Python.

>>> from htsql import HTSQL
>>> from htsql.request import produce
>>> htsql = HTSQL('pgsql:htsql_regress')
>>> with htsql:
...     for row in produce('/school'):
...         print row
...

As a Web Service

img/show_school.png

HTSQL can be included as part of your web service infrastructure: authenticated, cached, proxied, or how ever you wish to have it integrated.

Development Status

HTSQL is quite usable currently, but it may have gaps for a given application. Particular items we'll be addressing in coming months are:

How do I get it?

Open Community

Dual-License & Support

Q&A

Please visit our community site, http://htsql.org, our commercial site http://htsql.com, we are best found at #htsql on freenode. The source code is freely available at http://bitbucket.org/prometheus/htsql

Generous support for HTSQL was provided by The Simons Foundation. This material is also based upon work supported by the National Science Foundation under Grant #0944460.