Commit Graph

139 Commits

Author SHA1 Message Date
Colin Dellow 263a6af7ec Use Arrow's compression libraries
Fixes #27
2018-06-26 08:17:18 -04:00
Colin Dellow 129ff4e694
Merge pull request #25 from evansd/makefile-fixes
Makefile fixes
2018-06-25 13:54:29 -04:00
David Evans b7da04433b Include header locations we need 2018-06-25 18:20:24 +01:00
David Evans ab87b13b75 Reverse prereqs order to get build to work 2018-06-25 18:20:04 +01:00
Colin Dellow bc0be71546 Add brotli/snappy/gzip test files
`test/test-supported` verifies they can be opened
2018-06-25 08:32:36 -04:00
Colin Dellow 0bdcc9895e All-in-one build command
`./make-linux` clones and builds:

- arrow
- brotli
- lz4
- parquet
- snappy
- zlib
- zstd
- this project

as a statically linked binary. Two Boost libs are still pulled in as
shared libs, should probably fix that, too, for ultimate portability.
2018-06-24 21:11:07 -04:00
Colin Dellow ec6e970bbc Fix `order by rowid` to apply w/o clause
Fixes #12, first screen of datasette is fast now
2018-06-24 15:20:06 -04:00
Colin Dellow 5b59ba02fe Make ORDER BY ROWID fast
Fixes #11
2018-06-24 15:07:27 -04:00
Colin Dellow b774973852 Avoid row filter check when no constraints
The function call overhead is expensive!

This makes count(*) on the census data 175ms instead
of 225ms, while not significantly impacting other use cases.
2018-06-24 14:51:54 -04:00
Colin Dellow 84a22e6e77 link to blog 2018-06-24 11:39:44 -04:00
Colin Dellow 16cdd70f2b Short-circuit row group evaluation
We can avoid eagerly computing bitmasks for other constraints this way.

Possible future work - order the constraints such that we evaluate the
one that is cheapest/most likely to prune a row group first.

This reduces the cyclist query from ~65ms to ~60ms
2018-06-24 11:08:56 -04:00
Colin Dellow fd87c44ccd Add link to csv2parquet 2018-06-23 23:58:13 -04:00
Colin Dellow e1a86954e5 Revert "Don't eagerly evaluate constraints"
This reverts commit cbde3c73b6.

This regresses:

```
WITH inputs AS (
  SELECT
    geo_name,
    CASE WHEN profile_id = 1930 THEN 'total' ELSE 'cyclist' END AS mode,
    female,
    male
  FROM census
  WHERE profile_id IN ( '1930', '1935') AND
    csd_type_name = 'CY' AND
    geo_name IN ('Victoria', 'Dawson Creek', 'Kitchener')
)
SELECT
  total.geo_name,
  cyclist.male,
  cyclist.female,
  100.0 * cyclist.male / total.male,
  100.0 * cyclist.female / total.female
FROM inputs AS total
JOIN inputs AS cyclist USING (geo_name)
WHERE total.mode = 'total' AND cyclist.mode = 'cyclist';
```

while improving:

```
select count(*) from census where geo_name in ('Dawson Creek', 'Kitchener', 'Victoria') and csd_type_name = 'CY' and profile_id = '1930';
```

which seems like a bad tradeoff.
2018-06-23 20:48:39 -04:00
Colin Dellow 603153c36c avoid looking up physical type 2018-06-23 20:42:38 -04:00
Colin Dellow cbde3c73b6 Don't eagerly evaluate constraints
...to avoid decompressing columns when we know from previous
columns that the row can't match.

Fixes #10
2018-06-23 20:31:03 -04:00
Colin Dellow d7c5002cee Move some code out of ensureColumn
Saves ~4% on the cold census needle query (~425ms -> ~405ms)
2018-06-23 19:10:23 -04:00
Colin Dellow b9c58bd97e persist row group clauses on EOF
...not on close. Fixes #9
2018-06-23 16:25:56 -04:00
Colin Dellow 6d4be61261 tweak Makefile 2018-06-23 16:13:18 -04:00
Colin Dellow 596496c9cb rejig README 2018-03-25 00:07:56 -04:00
Colin Dellow d3ab5ff3e7 Cache clauses -> row group mapping
Create a shadow table. For `stats`, it'd be `_stats_rowgroups`.

It contains three columns:

- the clause (eg `city = 'Dawson Creek'`)
- the initial estimate, as a bitmap of rowgroups based on stats
- the actual observed rowgroups, as a bitmap

This papers over poorly sorted parquet files, at the cost of some disk
space. It makes interactive queries much more natural -- drilldown style
queries are much faster, as they can leverage work done by previous
queries.

eg 'SELECT * FROM stats WHERE city = 'Dawson Creek' and question_id >= 1935 and question_id <= 1940`
takes ~584ms on first run, but 9ms on subsequent runs.

We only create entries when the estimates don't match the actual
results.

Fixes #6
2018-03-24 23:57:15 -04:00
Colin Dellow d2c736f25a Add LIMIT/OFFSET to random queries 2018-03-24 19:02:30 -04:00
Colin Dellow cafd087113 Update README 2018-03-24 12:49:03 -04:00
Colin Dellow 51d0f27a68 don't segfault on low memory
Fixes #8
2018-03-24 12:48:29 -04:00
Colin Dellow 6fa7bc3d0b Add harness for low memory testing 2018-03-24 11:27:06 -04:00
Colin Dellow 599430b2f4 Add #ifdefs around printfs 2018-03-20 19:57:12 -04:00
Colin Dellow 5480de7fb6 Compile w/static linkages for parquet
Fixes #4. A stock Ubuntu 14.04 can now install sqlite3:amd64 and
libboost-all-dev, then use this module to read the test parquet file.
2018-03-20 19:06:39 -04:00
Colin Dellow 8bf890ab66 Fix incorrect row pruning for non-text BYTE_ARRAY 2018-03-18 19:43:09 -04:00
Colin Dellow 893e4c63f5 Add testcase generator
Very simplistics - select M fields, filters on N fields, slight bias to
use values of same type of the field it's comparing against.

No segfaults yet, but one test case that generates differing output when
run against `nulls` and `nulls1`:

```
select rowid from nulls1 where binary_9 >= '56' and ts_5 < 496886400000;
```
2018-03-18 19:11:26 -04:00
Colin Dellow 045e17da34 Note about 64-bit sqlite 2018-03-18 18:25:08 -04:00
Colin Dellow b0c7b229dd Create queries from templates if needed 2018-03-18 17:50:39 -04:00
Colin Dellow 7f2042742b Also compare queries against SQLite itself 2018-03-18 17:49:12 -04:00
Colin Dellow e2af2a07a4 Make rowid start from 1, not 0
Unclear whether this is strictly required, but I'm going to start using
SQLite as an oracle, and it'll be simpler if our rowids match theirs.
2018-03-18 17:03:46 -04:00
Colin Dellow d430a45e41 Update README 2018-03-18 15:08:02 -04:00
Colin Dellow 1f3ffce560 Row group filtering for BYTE_ARRAY 2018-03-18 15:03:08 -04:00
Colin Dellow 7b302a0eb2 Bail on rowId constraint when non-int 2018-03-18 14:31:23 -04:00
Colin Dellow 078754467e Generate queries from templates
Huzzah, a bunch of failures have appeared.
2018-03-18 14:28:31 -04:00
Colin Dellow e3f0dff083 Move queries/* to templates 2018-03-18 13:28:56 -04:00
Colin Dellow 65ea1b2f61 Rewrite tests for automatic generation
Regularize the parquets - nulls and nonulls each come in 3 variants,
with 1, 10 and 99 rows per rowgroup.

All test queries are written against nullsA, no_nullsA.

Next commit will introduce a tool to expand these template queries to
go against the actual tables.
2018-03-18 13:11:29 -04:00
Colin Dellow 3b557f7fb0 Add explicit test for file not found
...caching the metadata moved where ParquetTable did I/O,
which introduced a segfault on not found
2018-03-18 11:58:23 -04:00
Colin Dellow 4cbde9fc09 Row filtering for doubles 2018-03-17 16:09:57 -04:00
Colin Dellow 86e09b111e Add row filtering for int32/64/96/boolean 2018-03-17 16:05:38 -04:00
Colin Dellow a3af16eb54 Row-filtering for other string ops 2018-03-17 15:28:51 -04:00
Colin Dellow 03a20a9432 LIKE row group filtering
~1.7s -> ~1.0s for the census data set on `LIKE 'Dawson %'`
2018-03-17 00:11:38 -04:00
Colin Dellow 753a490687 Tests for blobs 2018-03-16 23:53:08 -04:00
Colin Dellow 01e8ffaba7 Row group filtering for double/float 2018-03-16 16:30:05 -04:00
Colin Dellow 9c22fd1f57 Row group filters for strings, int32/64/96, bools 2018-03-16 16:07:41 -04:00
Colin Dellow cbf388698b BOOL and INT96 tests 2018-03-16 16:02:11 -04:00
Colin Dellow e87f0d0f68 Note about versions 2018-03-16 00:19:25 -04:00
Colin Dellow 1f4cebe2a6 Don't use accessors
This drops the `= 'Dawson Creek'` query from 210ms to 145ms.

Maybe inlining would have been an option here? I'm not familiar enough
with g++ to know. :(
2018-03-15 23:04:11 -04:00
Colin Dellow 8ba13f44d5 Remove unnecessary copy
Now the `== 'Dawson Creek'` query is ~210ms, which is approx the
same as a `count(*)` query. This seems maybe OK, since the row group
filter is only excluding 30% of records.
2018-03-15 22:10:45 -04:00