`./make-linux` clones and builds:
- arrow
- brotli
- lz4
- parquet
- snappy
- zlib
- zstd
- this project
as a statically linked binary. Two Boost libs are still pulled in as
shared libs, should probably fix that, too, for ultimate portability.
The function call overhead is expensive!
This makes count(*) on the census data 175ms instead
of 225ms, while not significantly impacting other use cases.
We can avoid eagerly computing bitmasks for other constraints this way.
Possible future work - order the constraints such that we evaluate the
one that is cheapest/most likely to prune a row group first.
This reduces the cyclist query from ~65ms to ~60ms
This reverts commit cbde3c73b6.
This regresses:
```
WITH inputs AS (
SELECT
geo_name,
CASE WHEN profile_id = 1930 THEN 'total' ELSE 'cyclist' END AS mode,
female,
male
FROM census
WHERE profile_id IN ( '1930', '1935') AND
csd_type_name = 'CY' AND
geo_name IN ('Victoria', 'Dawson Creek', 'Kitchener')
)
SELECT
total.geo_name,
cyclist.male,
cyclist.female,
100.0 * cyclist.male / total.male,
100.0 * cyclist.female / total.female
FROM inputs AS total
JOIN inputs AS cyclist USING (geo_name)
WHERE total.mode = 'total' AND cyclist.mode = 'cyclist';
```
while improving:
```
select count(*) from census where geo_name in ('Dawson Creek', 'Kitchener', 'Victoria') and csd_type_name = 'CY' and profile_id = '1930';
```
which seems like a bad tradeoff.
Create a shadow table. For `stats`, it'd be `_stats_rowgroups`.
It contains three columns:
- the clause (eg `city = 'Dawson Creek'`)
- the initial estimate, as a bitmap of rowgroups based on stats
- the actual observed rowgroups, as a bitmap
This papers over poorly sorted parquet files, at the cost of some disk
space. It makes interactive queries much more natural -- drilldown style
queries are much faster, as they can leverage work done by previous
queries.
eg 'SELECT * FROM stats WHERE city = 'Dawson Creek' and question_id >= 1935 and question_id <= 1940`
takes ~584ms on first run, but 9ms on subsequent runs.
We only create entries when the estimates don't match the actual
results.
Fixes#6
Very simplistics - select M fields, filters on N fields, slight bias to
use values of same type of the field it's comparing against.
No segfaults yet, but one test case that generates differing output when
run against `nulls` and `nulls1`:
```
select rowid from nulls1 where binary_9 >= '56' and ts_5 < 496886400000;
```
Regularize the parquets - nulls and nonulls each come in 3 variants,
with 1, 10 and 99 rows per rowgroup.
All test queries are written against nullsA, no_nullsA.
Next commit will introduce a tool to expand these template queries to
go against the actual tables.
This drops the `= 'Dawson Creek'` query from 210ms to 145ms.
Maybe inlining would have been an option here? I'm not familiar enough
with g++ to know. :(
Now the `== 'Dawson Creek'` query is ~210ms, which is approx the
same as a `count(*)` query. This seems maybe OK, since the row group
filter is only excluding 30% of records.
This gets the census `== 'Dawson Creek'` query down to ~410ms from
~650ms.
That still seems much slower than it should be. Am I accidentally
doing a copy? Now to go learn how to profile C++ code...