Create a shadow table. For `stats`, it'd be `_stats_rowgroups`.
It contains three columns:
- the clause (eg `city = 'Dawson Creek'`)
- the initial estimate, as a bitmap of rowgroups based on stats
- the actual observed rowgroups, as a bitmap
This papers over poorly sorted parquet files, at the cost of some disk
space. It makes interactive queries much more natural -- drilldown style
queries are much faster, as they can leverage work done by previous
queries.
eg 'SELECT * FROM stats WHERE city = 'Dawson Creek' and question_id >= 1935 and question_id <= 1940`
takes ~584ms on first run, but 9ms on subsequent runs.
We only create entries when the estimates don't match the actual
results.
Fixes#6
Very simplistics - select M fields, filters on N fields, slight bias to
use values of same type of the field it's comparing against.
No segfaults yet, but one test case that generates differing output when
run against `nulls` and `nulls1`:
```
select rowid from nulls1 where binary_9 >= '56' and ts_5 < 496886400000;
```
Regularize the parquets - nulls and nonulls each come in 3 variants,
with 1, 10 and 99 rows per rowgroup.
All test queries are written against nullsA, no_nullsA.
Next commit will introduce a tool to expand these template queries to
go against the actual tables.
This drops the `= 'Dawson Creek'` query from 210ms to 145ms.
Maybe inlining would have been an option here? I'm not familiar enough
with g++ to know. :(
Now the `== 'Dawson Creek'` query is ~210ms, which is approx the
same as a `count(*)` query. This seems maybe OK, since the row group
filter is only excluding 30% of records.
This gets the census `== 'Dawson Creek'` query down to ~410ms from
~650ms.
That still seems much slower than it should be. Am I accidentally
doing a copy? Now to go learn how to profile C++ code...
For the statscan census set filtering on `== 'Dawson Creek'`, the query
goes from 980ms to 660ms.
This is expected, since the data isn't sorted by that column.
I'll try adding some scaffolding to do filtering at the row level, too.
We could also try unpacking the dictionary and testing the individual
values, although we may want some heuristics to decide whether it's
worth doing -- eg if < 10% of the rows have a unique value.
Ideally, this should be like a ~1ms query.