1
0
mirror of https://github.com/cldellow/sqlite-parquet-vtable.git synced 2025-03-06 07:19:45 +00:00

118 Commits

Author SHA1 Message Date
Colin Dellow
cafd087113 Update README 2018-03-24 12:49:03 -04:00
Colin Dellow
51d0f27a68 don't segfault on low memory
Fixes #8
2018-03-24 12:48:29 -04:00
Colin Dellow
6fa7bc3d0b Add harness for low memory testing 2018-03-24 11:27:06 -04:00
Colin Dellow
599430b2f4 Add #ifdefs around printfs 2018-03-20 19:57:12 -04:00
Colin Dellow
5480de7fb6 Compile w/static linkages for parquet
Fixes #4. A stock Ubuntu 14.04 can now install sqlite3:amd64 and
libboost-all-dev, then use this module to read the test parquet file.
2018-03-20 19:06:39 -04:00
Colin Dellow
8bf890ab66 Fix incorrect row pruning for non-text BYTE_ARRAY 2018-03-18 19:43:09 -04:00
Colin Dellow
893e4c63f5 Add testcase generator
Very simplistics - select M fields, filters on N fields, slight bias to
use values of same type of the field it's comparing against.

No segfaults yet, but one test case that generates differing output when
run against `nulls` and `nulls1`:

```
select rowid from nulls1 where binary_9 >= '56' and ts_5 < 496886400000;
```
2018-03-18 19:11:26 -04:00
Colin Dellow
045e17da34 Note about 64-bit sqlite 2018-03-18 18:25:08 -04:00
Colin Dellow
b0c7b229dd Create queries from templates if needed 2018-03-18 17:50:39 -04:00
Colin Dellow
7f2042742b Also compare queries against SQLite itself 2018-03-18 17:49:12 -04:00
Colin Dellow
e2af2a07a4 Make rowid start from 1, not 0
Unclear whether this is strictly required, but I'm going to start using
SQLite as an oracle, and it'll be simpler if our rowids match theirs.
2018-03-18 17:03:46 -04:00
Colin Dellow
d430a45e41 Update README 2018-03-18 15:08:02 -04:00
Colin Dellow
1f3ffce560 Row group filtering for BYTE_ARRAY 2018-03-18 15:03:08 -04:00
Colin Dellow
7b302a0eb2 Bail on rowId constraint when non-int 2018-03-18 14:31:23 -04:00
Colin Dellow
078754467e Generate queries from templates
Huzzah, a bunch of failures have appeared.
2018-03-18 14:28:31 -04:00
Colin Dellow
e3f0dff083 Move queries/* to templates 2018-03-18 13:28:56 -04:00
Colin Dellow
65ea1b2f61 Rewrite tests for automatic generation
Regularize the parquets - nulls and nonulls each come in 3 variants,
with 1, 10 and 99 rows per rowgroup.

All test queries are written against nullsA, no_nullsA.

Next commit will introduce a tool to expand these template queries to
go against the actual tables.
2018-03-18 13:11:29 -04:00
Colin Dellow
3b557f7fb0 Add explicit test for file not found
...caching the metadata moved where ParquetTable did I/O,
which introduced a segfault on not found
2018-03-18 11:58:23 -04:00
Colin Dellow
4cbde9fc09 Row filtering for doubles 2018-03-17 16:09:57 -04:00
Colin Dellow
86e09b111e Add row filtering for int32/64/96/boolean 2018-03-17 16:05:38 -04:00
Colin Dellow
a3af16eb54 Row-filtering for other string ops 2018-03-17 15:28:51 -04:00
Colin Dellow
03a20a9432 LIKE row group filtering
~1.7s -> ~1.0s for the census data set on `LIKE 'Dawson %'`
2018-03-17 00:11:38 -04:00
Colin Dellow
753a490687 Tests for blobs 2018-03-16 23:53:08 -04:00
Colin Dellow
01e8ffaba7 Row group filtering for double/float 2018-03-16 16:30:05 -04:00
Colin Dellow
9c22fd1f57 Row group filters for strings, int32/64/96, bools 2018-03-16 16:07:41 -04:00
Colin Dellow
cbf388698b BOOL and INT96 tests 2018-03-16 16:02:11 -04:00
Colin Dellow
e87f0d0f68 Note about versions 2018-03-16 00:19:25 -04:00
Colin Dellow
1f4cebe2a6 Don't use accessors
This drops the `= 'Dawson Creek'` query from 210ms to 145ms.

Maybe inlining would have been an option here? I'm not familiar enough
with g++ to know. :(
2018-03-15 23:04:11 -04:00
Colin Dellow
8ba13f44d5 Remove unnecessary copy
Now the `== 'Dawson Creek'` query is ~210ms, which is approx the
same as a `count(*)` query. This seems maybe OK, since the row group
filter is only excluding 30% of records.
2018-03-15 22:10:45 -04:00
Colin Dellow
f7f1ed03d1 add row filter for string ==
This gets the census `== 'Dawson Creek'` query down to ~410ms from
~650ms.

That still seems much slower than it should be. Am I accidentally
doing a copy? Now to go learn how to profile C++ code...
2018-03-15 21:37:52 -04:00
Colin Dellow
6648ff5968 add string == row group filter
For the statscan census set filtering on `== 'Dawson Creek'`, the query
goes from 980ms to 660ms.

This is expected, since the data isn't sorted by that column.

I'll try adding some scaffolding to do filtering at the row level, too.

We could also try unpacking the dictionary and testing the individual
values, although we may want some heuristics to decide whether it's
worth doing -- eg if < 10% of the rows have a unique value.

Ideally, this should be like a ~1ms query.
2018-03-15 20:40:21 -04:00
Colin Dellow
dc431aee20 Dispatch row group filtering based on parquet type 2018-03-15 20:25:02 -04:00
Colin Dellow
92ba5f94e0 reuse FileMetaData
For the statscan dataset, parsing the file metadata takes ~30-40ms,
so stash it away for future re-use.
2018-03-15 19:57:38 -04:00
Colin Dellow
769060dbcb Add stub row group filters for text/int/dbl
Checkpointing to investigate why min/max stats for text aren't
present
2018-03-12 23:07:41 -04:00
Colin Dellow
110e3e3668 row group skipping for is [not] null queries 2018-03-12 21:09:00 -04:00
Colin Dellow
95748a5192 Remove bool from Constraint 2018-03-12 20:50:30 -04:00
Colin Dellow
acc15256ec Add rowgroup filtering for rowid 2018-03-12 20:42:50 -04:00
Colin Dellow
1f938a005d More tests cases to deal with affinity
I'm not sure how these manifest - whether SQLite retypes them based on
column affinity before we see them, or whether they're provided as is.
2018-03-11 19:18:44 -04:00
Colin Dellow
095b576cc2 Scaffolding for row group filters, tests
rowid is special since its column index is -1, so add
explicit tests around it
2018-03-11 15:44:51 -04:00
Colin Dellow
5559a7b563 Fix when last rowgroup is not same size as first
...change test data to use 99 rows, so that when we have
rowgroup size 10 we exercise this code.
2018-03-11 15:15:27 -04:00
Colin Dellow
830053c1fc Scaffolding for in-extension filtering
Supports IS NULL and IS NOT NULL checks
2018-03-11 13:58:10 -04:00
Colin Dellow
d28ae86d15 Test unusable constraints 2018-03-10 13:38:34 -05:00
Colin Dellow
96fcafcd2f Add test cases 2018-03-10 13:25:13 -05:00
Colin Dellow
b7c134efc0 test-queries: can debug a testcase
`tests/test-queries regex` filters the test cases.

If the resulting set has only one test case, run it under gdb.
2018-03-10 11:54:36 -05:00
Colin Dellow
210f322a1c Code to pretty print constraints 2018-03-10 10:59:53 -05:00
Colin Dellow
2bc054a2cf Add crappy Makefile 2018-03-10 10:46:10 -05:00
Colin Dellow
824a416f51 better debug logs for xBestIndex 2018-03-08 13:21:33 -05:00
Colin Dellow
2d616c54fb More tests 2018-03-07 20:30:25 -05:00
Colin Dellow
35fcde926c Rewrite SQL oracle harness 2018-03-07 20:20:34 -05:00
Colin Dellow
caefc23b1e Add a pg oracle
- define `datetime`, `printf` fns in pg so it produces similar
  output as sqlite

- tidy up input data to be less wide

To do: some fns to make it easy to generate a new test case. Probably
want to mount all the 3 parquets simultaneously and refer to the
sqlite table by the same name as the pg table.
2018-03-07 19:40:38 -05:00