Commit Graph

10 Commits

Author SHA1 Message Date
Colin Dellow cbde3c73b6 Don't eagerly evaluate constraints
...to avoid decompressing columns when we know from previous
columns that the row can't match.

Fixes #10
2018-06-23 20:31:03 -04:00
Colin Dellow d3ab5ff3e7 Cache clauses -> row group mapping
Create a shadow table. For `stats`, it'd be `_stats_rowgroups`.

It contains three columns:

- the clause (eg `city = 'Dawson Creek'`)
- the initial estimate, as a bitmap of rowgroups based on stats
- the actual observed rowgroups, as a bitmap

This papers over poorly sorted parquet files, at the cost of some disk
space. It makes interactive queries much more natural -- drilldown style
queries are much faster, as they can leverage work done by previous
queries.

eg 'SELECT * FROM stats WHERE city = 'Dawson Creek' and question_id >= 1935 and question_id <= 1940`
takes ~584ms on first run, but 9ms on subsequent runs.

We only create entries when the estimates don't match the actual
results.

Fixes #6
2018-03-24 23:57:15 -04:00
Colin Dellow a3af16eb54 Row-filtering for other string ops 2018-03-17 15:28:51 -04:00
Colin Dellow 1f4cebe2a6 Don't use accessors
This drops the `= 'Dawson Creek'` query from 210ms to 145ms.

Maybe inlining would have been an option here? I'm not familiar enough
with g++ to know. :(
2018-03-15 23:04:11 -04:00
Colin Dellow f7f1ed03d1 add row filter for string ==
This gets the census `== 'Dawson Creek'` query down to ~410ms from
~650ms.

That still seems much slower than it should be. Am I accidentally
doing a copy? Now to go learn how to profile C++ code...
2018-03-15 21:37:52 -04:00
Colin Dellow 6648ff5968 add string == row group filter
For the statscan census set filtering on `== 'Dawson Creek'`, the query
goes from 980ms to 660ms.

This is expected, since the data isn't sorted by that column.

I'll try adding some scaffolding to do filtering at the row level, too.

We could also try unpacking the dictionary and testing the individual
values, although we may want some heuristics to decide whether it's
worth doing -- eg if < 10% of the rows have a unique value.

Ideally, this should be like a ~1ms query.
2018-03-15 20:40:21 -04:00
Colin Dellow 769060dbcb Add stub row group filters for text/int/dbl
Checkpointing to investigate why min/max stats for text aren't
present
2018-03-12 23:07:41 -04:00
Colin Dellow 95748a5192 Remove bool from Constraint 2018-03-12 20:50:30 -04:00
Colin Dellow acc15256ec Add rowgroup filtering for rowid 2018-03-12 20:42:50 -04:00
Colin Dellow 830053c1fc Scaffolding for in-extension filtering
Supports IS NULL and IS NOT NULL checks
2018-03-11 13:58:10 -04:00