1
0
mirror of https://github.com/cldellow/sqlite-parquet-vtable.git synced 2025-09-10 22:28:53 +00:00
Commit Graph

95 Commits

Author SHA1 Message Date
Colin Dellow
373616ad1e Don't try to optimize IsNot
Doesn't handle NULLs correctly, will open separate ticket
for it. Fixes the IS NOT case of #26
2018-07-04 18:59:16 -04:00
Colin Dellow
0aa98ae1a5 Skip shared parquet/arrow libs, fix icu versions 2018-06-27 23:20:33 -04:00
Colin Dellow
2167d102b4 Add make-linux-pgo
fixes #23, with perhaps some open questions about why PGO on
arrow/parquet-cpp regressed things.
2018-06-27 22:23:22 -04:00
Colin Dellow
1a4f540e18 Stub PGO code in
Incremental progress on #23 - should probably add a dedicated flag that
creates the instrumented binary, runs a test suite, then creates the
optimized binary.
2018-06-26 23:50:11 -04:00
Colin Dellow
1d0d4c08b8 Build sqlite in parallel 2018-06-26 23:05:30 -04:00
Colin Dellow
76fb058dd7 Link Boost statically
Fixes #15
2018-06-26 22:44:50 -04:00
Colin Dellow
263a6af7ec Use Arrow's compression libraries
Fixes #27
2018-06-26 08:17:18 -04:00
Colin Dellow
129ff4e694 Merge pull request #25 from evansd/makefile-fixes
Makefile fixes
2018-06-25 13:54:29 -04:00
David Evans
b7da04433b Include header locations we need 2018-06-25 18:20:24 +01:00
David Evans
ab87b13b75 Reverse prereqs order to get build to work 2018-06-25 18:20:04 +01:00
Colin Dellow
bc0be71546 Add brotli/snappy/gzip test files
`test/test-supported` verifies they can be opened
2018-06-25 08:32:36 -04:00
Colin Dellow
0bdcc9895e All-in-one build command
`./make-linux` clones and builds:

- arrow
- brotli
- lz4
- parquet
- snappy
- zlib
- zstd
- this project

as a statically linked binary. Two Boost libs are still pulled in as
shared libs, should probably fix that, too, for ultimate portability.
2018-06-24 21:11:07 -04:00
Colin Dellow
ec6e970bbc Fix order by rowid to apply w/o clause
Fixes #12, first screen of datasette is fast now
2018-06-24 15:20:06 -04:00
Colin Dellow
5b59ba02fe Make ORDER BY ROWID fast
Fixes #11
2018-06-24 15:07:27 -04:00
Colin Dellow
b774973852 Avoid row filter check when no constraints
The function call overhead is expensive!

This makes count(*) on the census data 175ms instead
of 225ms, while not significantly impacting other use cases.
2018-06-24 14:51:54 -04:00
Colin Dellow
84a22e6e77 link to blog 2018-06-24 11:39:44 -04:00
Colin Dellow
16cdd70f2b Short-circuit row group evaluation
We can avoid eagerly computing bitmasks for other constraints this way.

Possible future work - order the constraints such that we evaluate the
one that is cheapest/most likely to prune a row group first.

This reduces the cyclist query from ~65ms to ~60ms
2018-06-24 11:08:56 -04:00
Colin Dellow
fd87c44ccd Add link to csv2parquet 2018-06-23 23:58:13 -04:00
Colin Dellow
e1a86954e5 Revert "Don't eagerly evaluate constraints"
This reverts commit cbde3c73b6.

This regresses:

```
WITH inputs AS (
  SELECT
    geo_name,
    CASE WHEN profile_id = 1930 THEN 'total' ELSE 'cyclist' END AS mode,
    female,
    male
  FROM census
  WHERE profile_id IN ( '1930', '1935') AND
    csd_type_name = 'CY' AND
    geo_name IN ('Victoria', 'Dawson Creek', 'Kitchener')
)
SELECT
  total.geo_name,
  cyclist.male,
  cyclist.female,
  100.0 * cyclist.male / total.male,
  100.0 * cyclist.female / total.female
FROM inputs AS total
JOIN inputs AS cyclist USING (geo_name)
WHERE total.mode = 'total' AND cyclist.mode = 'cyclist';
```

while improving:

```
select count(*) from census where geo_name in ('Dawson Creek', 'Kitchener', 'Victoria') and csd_type_name = 'CY' and profile_id = '1930';
```

which seems like a bad tradeoff.
2018-06-23 20:48:39 -04:00
Colin Dellow
603153c36c avoid looking up physical type 2018-06-23 20:42:38 -04:00
Colin Dellow
cbde3c73b6 Don't eagerly evaluate constraints
...to avoid decompressing columns when we know from previous
columns that the row can't match.

Fixes #10
2018-06-23 20:31:03 -04:00
Colin Dellow
d7c5002cee Move some code out of ensureColumn
Saves ~4% on the cold census needle query (~425ms -> ~405ms)
2018-06-23 19:10:23 -04:00
Colin Dellow
b9c58bd97e persist row group clauses on EOF
...not on close. Fixes #9
2018-06-23 16:25:56 -04:00
Colin Dellow
6d4be61261 tweak Makefile 2018-06-23 16:13:18 -04:00
Colin Dellow
596496c9cb rejig README 2018-03-25 00:07:56 -04:00
Colin Dellow
d3ab5ff3e7 Cache clauses -> row group mapping
Create a shadow table. For `stats`, it'd be `_stats_rowgroups`.

It contains three columns:

- the clause (eg `city = 'Dawson Creek'`)
- the initial estimate, as a bitmap of rowgroups based on stats
- the actual observed rowgroups, as a bitmap

This papers over poorly sorted parquet files, at the cost of some disk
space. It makes interactive queries much more natural -- drilldown style
queries are much faster, as they can leverage work done by previous
queries.

eg 'SELECT * FROM stats WHERE city = 'Dawson Creek' and question_id >= 1935 and question_id <= 1940`
takes ~584ms on first run, but 9ms on subsequent runs.

We only create entries when the estimates don't match the actual
results.

Fixes #6
2018-03-24 23:57:15 -04:00
Colin Dellow
d2c736f25a Add LIMIT/OFFSET to random queries 2018-03-24 19:02:30 -04:00
Colin Dellow
cafd087113 Update README 2018-03-24 12:49:03 -04:00
Colin Dellow
51d0f27a68 don't segfault on low memory
Fixes #8
2018-03-24 12:48:29 -04:00
Colin Dellow
6fa7bc3d0b Add harness for low memory testing 2018-03-24 11:27:06 -04:00
Colin Dellow
599430b2f4 Add #ifdefs around printfs 2018-03-20 19:57:12 -04:00
Colin Dellow
5480de7fb6 Compile w/static linkages for parquet
Fixes #4. A stock Ubuntu 14.04 can now install sqlite3:amd64 and
libboost-all-dev, then use this module to read the test parquet file.
2018-03-20 19:06:39 -04:00
Colin Dellow
8bf890ab66 Fix incorrect row pruning for non-text BYTE_ARRAY 2018-03-18 19:43:09 -04:00
Colin Dellow
893e4c63f5 Add testcase generator
Very simplistics - select M fields, filters on N fields, slight bias to
use values of same type of the field it's comparing against.

No segfaults yet, but one test case that generates differing output when
run against `nulls` and `nulls1`:

```
select rowid from nulls1 where binary_9 >= '56' and ts_5 < 496886400000;
```
2018-03-18 19:11:26 -04:00
Colin Dellow
045e17da34 Note about 64-bit sqlite 2018-03-18 18:25:08 -04:00
Colin Dellow
b0c7b229dd Create queries from templates if needed 2018-03-18 17:50:39 -04:00
Colin Dellow
7f2042742b Also compare queries against SQLite itself 2018-03-18 17:49:12 -04:00
Colin Dellow
e2af2a07a4 Make rowid start from 1, not 0
Unclear whether this is strictly required, but I'm going to start using
SQLite as an oracle, and it'll be simpler if our rowids match theirs.
2018-03-18 17:03:46 -04:00
Colin Dellow
d430a45e41 Update README 2018-03-18 15:08:02 -04:00
Colin Dellow
1f3ffce560 Row group filtering for BYTE_ARRAY 2018-03-18 15:03:08 -04:00
Colin Dellow
7b302a0eb2 Bail on rowId constraint when non-int 2018-03-18 14:31:23 -04:00
Colin Dellow
078754467e Generate queries from templates
Huzzah, a bunch of failures have appeared.
2018-03-18 14:28:31 -04:00
Colin Dellow
e3f0dff083 Move queries/* to templates 2018-03-18 13:28:56 -04:00
Colin Dellow
65ea1b2f61 Rewrite tests for automatic generation
Regularize the parquets - nulls and nonulls each come in 3 variants,
with 1, 10 and 99 rows per rowgroup.

All test queries are written against nullsA, no_nullsA.

Next commit will introduce a tool to expand these template queries to
go against the actual tables.
2018-03-18 13:11:29 -04:00
Colin Dellow
3b557f7fb0 Add explicit test for file not found
...caching the metadata moved where ParquetTable did I/O,
which introduced a segfault on not found
2018-03-18 11:58:23 -04:00
Colin Dellow
4cbde9fc09 Row filtering for doubles 2018-03-17 16:09:57 -04:00
Colin Dellow
86e09b111e Add row filtering for int32/64/96/boolean 2018-03-17 16:05:38 -04:00
Colin Dellow
a3af16eb54 Row-filtering for other string ops 2018-03-17 15:28:51 -04:00
Colin Dellow
03a20a9432 LIKE row group filtering
~1.7s -> ~1.0s for the census data set on `LIKE 'Dawson %'`
2018-03-17 00:11:38 -04:00
Colin Dellow
753a490687 Tests for blobs 2018-03-16 23:53:08 -04:00