1
0
mirror of https://github.com/cldellow/sqlite-parquet-vtable.git synced 2025-02-26 06:39:45 +00:00

23 Commits

Author SHA1 Message Date
Colin Dellow
01e8ffaba7 Row group filtering for double/float 2018-03-16 16:30:05 -04:00
Colin Dellow
9c22fd1f57 Row group filters for strings, int32/64/96, bools 2018-03-16 16:07:41 -04:00
Colin Dellow
1f4cebe2a6 Don't use accessors
This drops the `= 'Dawson Creek'` query from 210ms to 145ms.

Maybe inlining would have been an option here? I'm not familiar enough
with g++ to know. :(
2018-03-15 23:04:11 -04:00
Colin Dellow
8ba13f44d5 Remove unnecessary copy
Now the `== 'Dawson Creek'` query is ~210ms, which is approx the
same as a `count(*)` query. This seems maybe OK, since the row group
filter is only excluding 30% of records.
2018-03-15 22:10:45 -04:00
Colin Dellow
f7f1ed03d1 add row filter for string ==
This gets the census `== 'Dawson Creek'` query down to ~410ms from
~650ms.

That still seems much slower than it should be. Am I accidentally
doing a copy? Now to go learn how to profile C++ code...
2018-03-15 21:37:52 -04:00
Colin Dellow
6648ff5968 add string == row group filter
For the statscan census set filtering on `== 'Dawson Creek'`, the query
goes from 980ms to 660ms.

This is expected, since the data isn't sorted by that column.

I'll try adding some scaffolding to do filtering at the row level, too.

We could also try unpacking the dictionary and testing the individual
values, although we may want some heuristics to decide whether it's
worth doing -- eg if < 10% of the rows have a unique value.

Ideally, this should be like a ~1ms query.
2018-03-15 20:40:21 -04:00
Colin Dellow
dc431aee20 Dispatch row group filtering based on parquet type 2018-03-15 20:25:02 -04:00
Colin Dellow
92ba5f94e0 reuse FileMetaData
For the statscan dataset, parsing the file metadata takes ~30-40ms,
so stash it away for future re-use.
2018-03-15 19:57:38 -04:00
Colin Dellow
769060dbcb Add stub row group filters for text/int/dbl
Checkpointing to investigate why min/max stats for text aren't
present
2018-03-12 23:07:41 -04:00
Colin Dellow
110e3e3668 row group skipping for is [not] null queries 2018-03-12 21:09:00 -04:00
Colin Dellow
acc15256ec Add rowgroup filtering for rowid 2018-03-12 20:42:50 -04:00
Colin Dellow
1f938a005d More tests cases to deal with affinity
I'm not sure how these manifest - whether SQLite retypes them based on
column affinity before we see them, or whether they're provided as is.
2018-03-11 19:18:44 -04:00
Colin Dellow
095b576cc2 Scaffolding for row group filters, tests
rowid is special since its column index is -1, so add
explicit tests around it
2018-03-11 15:44:51 -04:00
Colin Dellow
5559a7b563 Fix when last rowgroup is not same size as first
...change test data to use 99 rows, so that when we have
rowgroup size 10 we exercise this code.
2018-03-11 15:15:27 -04:00
Colin Dellow
830053c1fc Scaffolding for in-extension filtering
Supports IS NULL and IS NOT NULL checks
2018-03-11 13:58:10 -04:00
Colin Dellow
210f322a1c Code to pretty print constraints 2018-03-10 10:59:53 -05:00
Colin Dellow
67005623df ensureColumn catches up when rows are skipped 2018-03-04 22:29:35 -05:00
Colin Dellow
bb3a9440f7 Add query test framework, fix xFilter 2018-03-04 21:05:26 -05:00
Colin Dellow
4c54ab89ae Don't segfault on full table scan 2018-03-04 17:49:19 -05:00
Colin Dellow
7edb5e472f Support BLOBs 2018-03-04 17:20:59 -05:00
Colin Dellow
67b0d96967 float support 2018-03-03 20:57:09 -05:00
Colin Dellow
eb0b48f867 Boolean, INT96, INT64 2018-03-03 20:00:50 -05:00
Colin Dellow
1de843fca8 Very rough first cut
supports int32, double, strings.
2018-03-03 15:44:01 -05:00