Commit Graph

138 Commits

Author SHA1 Message Date
Colin Dellow 6648ff5968 add string == row group filter
For the statscan census set filtering on `== 'Dawson Creek'`, the query
goes from 980ms to 660ms.

This is expected, since the data isn't sorted by that column.

I'll try adding some scaffolding to do filtering at the row level, too.

We could also try unpacking the dictionary and testing the individual
values, although we may want some heuristics to decide whether it's
worth doing -- eg if < 10% of the rows have a unique value.

Ideally, this should be like a ~1ms query.
2018-03-15 20:40:21 -04:00
Colin Dellow dc431aee20 Dispatch row group filtering based on parquet type 2018-03-15 20:25:02 -04:00
Colin Dellow 92ba5f94e0 reuse FileMetaData
For the statscan dataset, parsing the file metadata takes ~30-40ms,
so stash it away for future re-use.
2018-03-15 19:57:38 -04:00
Colin Dellow 769060dbcb Add stub row group filters for text/int/dbl
Checkpointing to investigate why min/max stats for text aren't
present
2018-03-12 23:07:41 -04:00
Colin Dellow 110e3e3668 row group skipping for is [not] null queries 2018-03-12 21:09:00 -04:00
Colin Dellow 95748a5192 Remove bool from Constraint 2018-03-12 20:50:30 -04:00
Colin Dellow acc15256ec Add rowgroup filtering for rowid 2018-03-12 20:42:50 -04:00
Colin Dellow 1f938a005d More tests cases to deal with affinity
I'm not sure how these manifest - whether SQLite retypes them based on
column affinity before we see them, or whether they're provided as is.
2018-03-11 19:18:44 -04:00
Colin Dellow 095b576cc2 Scaffolding for row group filters, tests
rowid is special since its column index is -1, so add
explicit tests around it
2018-03-11 15:44:51 -04:00
Colin Dellow 5559a7b563 Fix when last rowgroup is not same size as first
...change test data to use 99 rows, so that when we have
rowgroup size 10 we exercise this code.
2018-03-11 15:15:27 -04:00
Colin Dellow 830053c1fc Scaffolding for in-extension filtering
Supports IS NULL and IS NOT NULL checks
2018-03-11 13:58:10 -04:00
Colin Dellow d28ae86d15 Test unusable constraints 2018-03-10 13:38:34 -05:00
Colin Dellow 96fcafcd2f Add test cases 2018-03-10 13:25:13 -05:00
Colin Dellow b7c134efc0 test-queries: can debug a testcase
`tests/test-queries regex` filters the test cases.

If the resulting set has only one test case, run it under gdb.
2018-03-10 11:54:36 -05:00
Colin Dellow 210f322a1c Code to pretty print constraints 2018-03-10 10:59:53 -05:00
Colin Dellow 2bc054a2cf Add crappy Makefile 2018-03-10 10:46:10 -05:00
Colin Dellow 824a416f51 better debug logs for xBestIndex 2018-03-08 13:21:33 -05:00
Colin Dellow 2d616c54fb More tests 2018-03-07 20:30:25 -05:00
Colin Dellow 35fcde926c Rewrite SQL oracle harness 2018-03-07 20:20:34 -05:00
Colin Dellow caefc23b1e Add a pg oracle
- define `datetime`, `printf` fns in pg so it produces similar
  output as sqlite

- tidy up input data to be less wide

To do: some fns to make it easy to generate a new test case. Probably
want to mount all the 3 parquets simultaneously and refer to the
sqlite table by the same name as the pg table.
2018-03-07 19:40:38 -05:00
Colin Dellow 0d4806ca6f Rejig parquet generation
- "fixed_size_binary" -> "binary_10"
- make null parquet use rowgroups of sie 10: first rowgroup
  has no nulls, 2nd has all null, 3rd-10th have alternating
  nulls

This is prep for making a Postgres layer to use as an oracle
for generating test cases so that we have good coverage before
implementing advanced `xBestIndex` and `xFilter` modes.
2018-03-06 21:02:26 -05:00
Colin Dellow 56245c1d3d test case for nulls 2018-03-04 22:48:39 -05:00
Colin Dellow 67005623df `ensureColumn` catches up when rows are skipped 2018-03-04 22:29:35 -05:00
Colin Dellow bb3a9440f7 Add query test framework, fix xFilter 2018-03-04 21:05:26 -05:00
Colin Dellow 4c54ab89ae Don't segfault on full table scan 2018-03-04 17:49:19 -05:00
Colin Dellow 7edb5e472f Support BLOBs 2018-03-04 17:20:59 -05:00
Colin Dellow f3e78408bf Update demo to use checked in parquet 2018-03-04 13:06:50 -05:00
Colin Dellow aea9469bff tweak wording 2018-03-04 13:04:58 -05:00
Colin Dellow a4f368af9c Add tests for unsupported types 2018-03-04 13:02:42 -05:00
Colin Dellow 681c8a443f Add tool to generate sample parquets 2018-03-04 12:15:50 -05:00
Colin Dellow 67b0d96967 float support 2018-03-03 20:57:09 -05:00
Colin Dellow 18f07f4c43 More defensive, add caveats 2018-03-03 20:30:46 -05:00
Colin Dellow eb0b48f867 Boolean, INT96, INT64 2018-03-03 20:00:50 -05:00
Colin Dellow 1de843fca8 Very rough first cut
supports int32, double, strings.
2018-03-03 15:44:01 -05:00
Colin Dellow f8599f8d3e Rename some references to CSVs
...some nonsensical things, like "first row of Parquet",
but we'll tidy them up later.
2018-03-02 19:18:36 -05:00
Colin Dellow 552da5a647 Initial checkin of CSV table
parquet.cc is a fork of the sample CSV virtual table at
https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/misc/csv.c

So far the only changes are those needed to make it compile cleanly in
C++11 mode.
2018-03-02 18:59:34 -05:00
Colin Dellow 811badc9f9 Add script to fetch+build sqlite 2018-03-02 18:46:40 -05:00
Colin Dellow 8b9b3bcc9d
Initial commit 2018-03-02 18:37:08 -05:00