Commit Graph

62 Commits

Author SHA1 Message Date
Dan Forsberg 723b3d6ac7 Switch to Arrow and Ubuntu packages 2020-10-19 12:48:20 +01:00
Colin Dellow d3128498b1 Escape column names in CREATE
Fixes #33

Add test case with a hyphen and an embedded quotation mark
2018-07-08 10:37:18 -04:00
Colin Dellow a52ad9c20c simplify error checking 2018-07-05 22:09:22 -04:00
Colin Dellow e83a21db15 remove match/regexp support
These are only enabled if we implement `xFindFunction`, which we don't
2018-07-05 21:49:28 -04:00
Colin Dellow 80adb804b4 Don't include debug fns in non debug builds 2018-07-05 19:02:24 -04:00
Colin Dellow 373616ad1e Don't try to optimize IsNot
Doesn't handle NULLs correctly, will open separate ticket
for it. Fixes the IS NOT case of #26
2018-07-04 18:59:16 -04:00
Colin Dellow 0bdcc9895e All-in-one build command
`./make-linux` clones and builds:

- arrow
- brotli
- lz4
- parquet
- snappy
- zlib
- zstd
- this project

as a statically linked binary. Two Boost libs are still pulled in as
shared libs, should probably fix that, too, for ultimate portability.
2018-06-24 21:11:07 -04:00
Colin Dellow ec6e970bbc Fix `order by rowid` to apply w/o clause
Fixes #12, first screen of datasette is fast now
2018-06-24 15:20:06 -04:00
Colin Dellow 5b59ba02fe Make ORDER BY ROWID fast
Fixes #11
2018-06-24 15:07:27 -04:00
Colin Dellow b774973852 Avoid row filter check when no constraints
The function call overhead is expensive!

This makes count(*) on the census data 175ms instead
of 225ms, while not significantly impacting other use cases.
2018-06-24 14:51:54 -04:00
Colin Dellow 16cdd70f2b Short-circuit row group evaluation
We can avoid eagerly computing bitmasks for other constraints this way.

Possible future work - order the constraints such that we evaluate the
one that is cheapest/most likely to prune a row group first.

This reduces the cyclist query from ~65ms to ~60ms
2018-06-24 11:08:56 -04:00
Colin Dellow e1a86954e5 Revert "Don't eagerly evaluate constraints"
This reverts commit cbde3c73b6.

This regresses:

```
WITH inputs AS (
  SELECT
    geo_name,
    CASE WHEN profile_id = 1930 THEN 'total' ELSE 'cyclist' END AS mode,
    female,
    male
  FROM census
  WHERE profile_id IN ( '1930', '1935') AND
    csd_type_name = 'CY' AND
    geo_name IN ('Victoria', 'Dawson Creek', 'Kitchener')
)
SELECT
  total.geo_name,
  cyclist.male,
  cyclist.female,
  100.0 * cyclist.male / total.male,
  100.0 * cyclist.female / total.female
FROM inputs AS total
JOIN inputs AS cyclist USING (geo_name)
WHERE total.mode = 'total' AND cyclist.mode = 'cyclist';
```

while improving:

```
select count(*) from census where geo_name in ('Dawson Creek', 'Kitchener', 'Victoria') and csd_type_name = 'CY' and profile_id = '1930';
```

which seems like a bad tradeoff.
2018-06-23 20:48:39 -04:00
Colin Dellow 603153c36c avoid looking up physical type 2018-06-23 20:42:38 -04:00
Colin Dellow cbde3c73b6 Don't eagerly evaluate constraints
...to avoid decompressing columns when we know from previous
columns that the row can't match.

Fixes #10
2018-06-23 20:31:03 -04:00
Colin Dellow d7c5002cee Move some code out of ensureColumn
Saves ~4% on the cold census needle query (~425ms -> ~405ms)
2018-06-23 19:10:23 -04:00
Colin Dellow b9c58bd97e persist row group clauses on EOF
...not on close. Fixes #9
2018-06-23 16:25:56 -04:00
Colin Dellow 6d4be61261 tweak Makefile 2018-06-23 16:13:18 -04:00
Colin Dellow d3ab5ff3e7 Cache clauses -> row group mapping
Create a shadow table. For `stats`, it'd be `_stats_rowgroups`.

It contains three columns:

- the clause (eg `city = 'Dawson Creek'`)
- the initial estimate, as a bitmap of rowgroups based on stats
- the actual observed rowgroups, as a bitmap

This papers over poorly sorted parquet files, at the cost of some disk
space. It makes interactive queries much more natural -- drilldown style
queries are much faster, as they can leverage work done by previous
queries.

eg 'SELECT * FROM stats WHERE city = 'Dawson Creek' and question_id >= 1935 and question_id <= 1940`
takes ~584ms on first run, but 9ms on subsequent runs.

We only create entries when the estimates don't match the actual
results.

Fixes #6
2018-03-24 23:57:15 -04:00
Colin Dellow 51d0f27a68 don't segfault on low memory
Fixes #8
2018-03-24 12:48:29 -04:00
Colin Dellow 599430b2f4 Add #ifdefs around printfs 2018-03-20 19:57:12 -04:00
Colin Dellow 5480de7fb6 Compile w/static linkages for parquet
Fixes #4. A stock Ubuntu 14.04 can now install sqlite3:amd64 and
libboost-all-dev, then use this module to read the test parquet file.
2018-03-20 19:06:39 -04:00
Colin Dellow 8bf890ab66 Fix incorrect row pruning for non-text BYTE_ARRAY 2018-03-18 19:43:09 -04:00
Colin Dellow e2af2a07a4 Make rowid start from 1, not 0
Unclear whether this is strictly required, but I'm going to start using
SQLite as an oracle, and it'll be simpler if our rowids match theirs.
2018-03-18 17:03:46 -04:00
Colin Dellow 1f3ffce560 Row group filtering for BYTE_ARRAY 2018-03-18 15:03:08 -04:00
Colin Dellow 7b302a0eb2 Bail on rowId constraint when non-int 2018-03-18 14:31:23 -04:00
Colin Dellow 3b557f7fb0 Add explicit test for file not found
...caching the metadata moved where ParquetTable did I/O,
which introduced a segfault on not found
2018-03-18 11:58:23 -04:00
Colin Dellow 4cbde9fc09 Row filtering for doubles 2018-03-17 16:09:57 -04:00
Colin Dellow 86e09b111e Add row filtering for int32/64/96/boolean 2018-03-17 16:05:38 -04:00
Colin Dellow a3af16eb54 Row-filtering for other string ops 2018-03-17 15:28:51 -04:00
Colin Dellow 03a20a9432 LIKE row group filtering
~1.7s -> ~1.0s for the census data set on `LIKE 'Dawson %'`
2018-03-17 00:11:38 -04:00
Colin Dellow 01e8ffaba7 Row group filtering for double/float 2018-03-16 16:30:05 -04:00
Colin Dellow 9c22fd1f57 Row group filters for strings, int32/64/96, bools 2018-03-16 16:07:41 -04:00
Colin Dellow 1f4cebe2a6 Don't use accessors
This drops the `= 'Dawson Creek'` query from 210ms to 145ms.

Maybe inlining would have been an option here? I'm not familiar enough
with g++ to know. :(
2018-03-15 23:04:11 -04:00
Colin Dellow 8ba13f44d5 Remove unnecessary copy
Now the `== 'Dawson Creek'` query is ~210ms, which is approx the
same as a `count(*)` query. This seems maybe OK, since the row group
filter is only excluding 30% of records.
2018-03-15 22:10:45 -04:00
Colin Dellow f7f1ed03d1 add row filter for string ==
This gets the census `== 'Dawson Creek'` query down to ~410ms from
~650ms.

That still seems much slower than it should be. Am I accidentally
doing a copy? Now to go learn how to profile C++ code...
2018-03-15 21:37:52 -04:00
Colin Dellow 6648ff5968 add string == row group filter
For the statscan census set filtering on `== 'Dawson Creek'`, the query
goes from 980ms to 660ms.

This is expected, since the data isn't sorted by that column.

I'll try adding some scaffolding to do filtering at the row level, too.

We could also try unpacking the dictionary and testing the individual
values, although we may want some heuristics to decide whether it's
worth doing -- eg if < 10% of the rows have a unique value.

Ideally, this should be like a ~1ms query.
2018-03-15 20:40:21 -04:00
Colin Dellow dc431aee20 Dispatch row group filtering based on parquet type 2018-03-15 20:25:02 -04:00
Colin Dellow 92ba5f94e0 reuse FileMetaData
For the statscan dataset, parsing the file metadata takes ~30-40ms,
so stash it away for future re-use.
2018-03-15 19:57:38 -04:00
Colin Dellow 769060dbcb Add stub row group filters for text/int/dbl
Checkpointing to investigate why min/max stats for text aren't
present
2018-03-12 23:07:41 -04:00
Colin Dellow 110e3e3668 row group skipping for is [not] null queries 2018-03-12 21:09:00 -04:00
Colin Dellow 95748a5192 Remove bool from Constraint 2018-03-12 20:50:30 -04:00
Colin Dellow acc15256ec Add rowgroup filtering for rowid 2018-03-12 20:42:50 -04:00
Colin Dellow 1f938a005d More tests cases to deal with affinity
I'm not sure how these manifest - whether SQLite retypes them based on
column affinity before we see them, or whether they're provided as is.
2018-03-11 19:18:44 -04:00
Colin Dellow 095b576cc2 Scaffolding for row group filters, tests
rowid is special since its column index is -1, so add
explicit tests around it
2018-03-11 15:44:51 -04:00
Colin Dellow 5559a7b563 Fix when last rowgroup is not same size as first
...change test data to use 99 rows, so that when we have
rowgroup size 10 we exercise this code.
2018-03-11 15:15:27 -04:00
Colin Dellow 830053c1fc Scaffolding for in-extension filtering
Supports IS NULL and IS NOT NULL checks
2018-03-11 13:58:10 -04:00
Colin Dellow d28ae86d15 Test unusable constraints 2018-03-10 13:38:34 -05:00
Colin Dellow 96fcafcd2f Add test cases 2018-03-10 13:25:13 -05:00
Colin Dellow 210f322a1c Code to pretty print constraints 2018-03-10 10:59:53 -05:00
Colin Dellow 2bc054a2cf Add crappy Makefile 2018-03-10 10:46:10 -05:00