Very simplistics - select M fields, filters on N fields, slight bias to
use values of same type of the field it's comparing against.
No segfaults yet, but one test case that generates differing output when
run against `nulls` and `nulls1`:
```
select rowid from nulls1 where binary_9 >= '56' and ts_5 < 496886400000;
```
Regularize the parquets - nulls and nonulls each come in 3 variants,
with 1, 10 and 99 rows per rowgroup.
All test queries are written against nullsA, no_nullsA.
Next commit will introduce a tool to expand these template queries to
go against the actual tables.
This drops the `= 'Dawson Creek'` query from 210ms to 145ms.
Maybe inlining would have been an option here? I'm not familiar enough
with g++ to know. :(
Now the `== 'Dawson Creek'` query is ~210ms, which is approx the
same as a `count(*)` query. This seems maybe OK, since the row group
filter is only excluding 30% of records.
This gets the census `== 'Dawson Creek'` query down to ~410ms from
~650ms.
That still seems much slower than it should be. Am I accidentally
doing a copy? Now to go learn how to profile C++ code...
For the statscan census set filtering on `== 'Dawson Creek'`, the query
goes from 980ms to 660ms.
This is expected, since the data isn't sorted by that column.
I'll try adding some scaffolding to do filtering at the row level, too.
We could also try unpacking the dictionary and testing the individual
values, although we may want some heuristics to decide whether it's
worth doing -- eg if < 10% of the rows have a unique value.
Ideally, this should be like a ~1ms query.
- define `datetime`, `printf` fns in pg so it produces similar
output as sqlite
- tidy up input data to be less wide
To do: some fns to make it easy to generate a new test case. Probably
want to mount all the 3 parquets simultaneously and refer to the
sqlite table by the same name as the pg table.