The function call overhead is expensive!
This makes count(*) on the census data 175ms instead
of 225ms, while not significantly impacting other use cases.
We can avoid eagerly computing bitmasks for other constraints this way.
Possible future work - order the constraints such that we evaluate the
one that is cheapest/most likely to prune a row group first.
This reduces the cyclist query from ~65ms to ~60ms
This reverts commit cbde3c73b6.
This regresses:
```
WITH inputs AS (
SELECT
geo_name,
CASE WHEN profile_id = 1930 THEN 'total' ELSE 'cyclist' END AS mode,
female,
male
FROM census
WHERE profile_id IN ( '1930', '1935') AND
csd_type_name = 'CY' AND
geo_name IN ('Victoria', 'Dawson Creek', 'Kitchener')
)
SELECT
total.geo_name,
cyclist.male,
cyclist.female,
100.0 * cyclist.male / total.male,
100.0 * cyclist.female / total.female
FROM inputs AS total
JOIN inputs AS cyclist USING (geo_name)
WHERE total.mode = 'total' AND cyclist.mode = 'cyclist';
```
while improving:
```
select count(*) from census where geo_name in ('Dawson Creek', 'Kitchener', 'Victoria') and csd_type_name = 'CY' and profile_id = '1930';
```
which seems like a bad tradeoff.
Create a shadow table. For `stats`, it'd be `_stats_rowgroups`.
It contains three columns:
- the clause (eg `city = 'Dawson Creek'`)
- the initial estimate, as a bitmap of rowgroups based on stats
- the actual observed rowgroups, as a bitmap
This papers over poorly sorted parquet files, at the cost of some disk
space. It makes interactive queries much more natural -- drilldown style
queries are much faster, as they can leverage work done by previous
queries.
eg 'SELECT * FROM stats WHERE city = 'Dawson Creek' and question_id >= 1935 and question_id <= 1940`
takes ~584ms on first run, but 9ms on subsequent runs.
We only create entries when the estimates don't match the actual
results.
Fixes#6
This drops the `= 'Dawson Creek'` query from 210ms to 145ms.
Maybe inlining would have been an option here? I'm not familiar enough
with g++ to know. :(
Now the `== 'Dawson Creek'` query is ~210ms, which is approx the
same as a `count(*)` query. This seems maybe OK, since the row group
filter is only excluding 30% of records.
This gets the census `== 'Dawson Creek'` query down to ~410ms from
~650ms.
That still seems much slower than it should be. Am I accidentally
doing a copy? Now to go learn how to profile C++ code...
For the statscan census set filtering on `== 'Dawson Creek'`, the query
goes from 980ms to 660ms.
This is expected, since the data isn't sorted by that column.
I'll try adding some scaffolding to do filtering at the row level, too.
We could also try unpacking the dictionary and testing the individual
values, although we may want some heuristics to decide whether it's
worth doing -- eg if < 10% of the rows have a unique value.
Ideally, this should be like a ~1ms query.
- "fixed_size_binary" -> "binary_10"
- make null parquet use rowgroups of sie 10: first rowgroup
has no nulls, 2nd has all null, 3rd-10th have alternating
nulls
This is prep for making a Postgres layer to use as an oracle
for generating test cases so that we have good coverage before
implementing advanced `xBestIndex` and `xFilter` modes.