sqlite-parquet-vtable

mirror of https://github.com/cldellow/sqlite-parquet-vtable.git synced 2025-12-20 06:23:29 +00:00

Author	SHA1	Message	Date
Colin Dellow	0bdcc9895e	All-in-one build command `./make-linux` clones and builds: - arrow - brotli - lz4 - parquet - snappy - zlib - zstd - this project as a statically linked binary. Two Boost libs are still pulled in as shared libs, should probably fix that, too, for ultimate portability.	2018-06-24 21:11:07 -04:00
Colin Dellow	ec6e970bbc	Fix `order by rowid` to apply w/o clause Fixes #12, first screen of datasette is fast now	2018-06-24 15:20:06 -04:00
Colin Dellow	5b59ba02fe	Make ORDER BY ROWID fast Fixes #11	2018-06-24 15:07:27 -04:00
Colin Dellow	b774973852	Avoid row filter check when no constraints The function call overhead is expensive! This makes count(*) on the census data 175ms instead of 225ms, while not significantly impacting other use cases.	2018-06-24 14:51:54 -04:00
Colin Dellow	84a22e6e77	link to blog	2018-06-24 11:39:44 -04:00
Colin Dellow	16cdd70f2b	Short-circuit row group evaluation We can avoid eagerly computing bitmasks for other constraints this way. Possible future work - order the constraints such that we evaluate the one that is cheapest/most likely to prune a row group first. This reduces the cyclist query from ~65ms to ~60ms	2018-06-24 11:08:56 -04:00
Colin Dellow	fd87c44ccd	Add link to csv2parquet	2018-06-23 23:58:13 -04:00
Colin Dellow	e1a86954e5	Revert "Don't eagerly evaluate constraints" This reverts commit `cbde3c73b6`. This regresses: ``` WITH inputs AS ( SELECT geo_name, CASE WHEN profile_id = 1930 THEN 'total' ELSE 'cyclist' END AS mode, female, male FROM census WHERE profile_id IN ( '1930', '1935') AND csd_type_name = 'CY' AND geo_name IN ('Victoria', 'Dawson Creek', 'Kitchener') ) SELECT total.geo_name, cyclist.male, cyclist.female, 100.0 * cyclist.male / total.male, 100.0 * cyclist.female / total.female FROM inputs AS total JOIN inputs AS cyclist USING (geo_name) WHERE total.mode = 'total' AND cyclist.mode = 'cyclist'; ``` while improving: ``` select count(*) from census where geo_name in ('Dawson Creek', 'Kitchener', 'Victoria') and csd_type_name = 'CY' and profile_id = '1930'; ``` which seems like a bad tradeoff.	2018-06-23 20:48:39 -04:00
Colin Dellow	603153c36c	avoid looking up physical type	2018-06-23 20:42:38 -04:00
Colin Dellow	cbde3c73b6	Don't eagerly evaluate constraints ...to avoid decompressing columns when we know from previous columns that the row can't match. Fixes #10	2018-06-23 20:31:03 -04:00
Colin Dellow	d7c5002cee	Move some code out of ensureColumn Saves ~4% on the cold census needle query (~425ms -> ~405ms)	2018-06-23 19:10:23 -04:00
Colin Dellow	b9c58bd97e	persist row group clauses on EOF ...not on close. Fixes #9	2018-06-23 16:25:56 -04:00
Colin Dellow	6d4be61261	tweak Makefile	2018-06-23 16:13:18 -04:00
Colin Dellow	596496c9cb	rejig README	2018-03-25 00:07:56 -04:00
Colin Dellow	d3ab5ff3e7	Cache clauses -> row group mapping Create a shadow table. For `stats`, it'd be `_stats_rowgroups`. It contains three columns: - the clause (eg `city = 'Dawson Creek'`) - the initial estimate, as a bitmap of rowgroups based on stats - the actual observed rowgroups, as a bitmap This papers over poorly sorted parquet files, at the cost of some disk space. It makes interactive queries much more natural -- drilldown style queries are much faster, as they can leverage work done by previous queries. eg 'SELECT * FROM stats WHERE city = 'Dawson Creek' and question_id >= 1935 and question_id <= 1940` takes ~584ms on first run, but 9ms on subsequent runs. We only create entries when the estimates don't match the actual results. Fixes #6	2018-03-24 23:57:15 -04:00
Colin Dellow	d2c736f25a	Add LIMIT/OFFSET to random queries	2018-03-24 19:02:30 -04:00
Colin Dellow	cafd087113	Update README	2018-03-24 12:49:03 -04:00
Colin Dellow	51d0f27a68	don't segfault on low memory Fixes #8	2018-03-24 12:48:29 -04:00
Colin Dellow	6fa7bc3d0b	Add harness for low memory testing	2018-03-24 11:27:06 -04:00
Colin Dellow	599430b2f4	Add #ifdefs around printfs	2018-03-20 19:57:12 -04:00
Colin Dellow	5480de7fb6	Compile w/static linkages for parquet Fixes #4. A stock Ubuntu 14.04 can now install sqlite3:amd64 and libboost-all-dev, then use this module to read the test parquet file.	2018-03-20 19:06:39 -04:00
Colin Dellow	8bf890ab66	Fix incorrect row pruning for non-text BYTE_ARRAY	2018-03-18 19:43:09 -04:00
Colin Dellow	893e4c63f5	Add testcase generator Very simplistics - select M fields, filters on N fields, slight bias to use values of same type of the field it's comparing against. No segfaults yet, but one test case that generates differing output when run against `nulls` and `nulls1`: ``` select rowid from nulls1 where binary_9 >= '56' and ts_5 < 496886400000; ```	2018-03-18 19:11:26 -04:00
Colin Dellow	045e17da34	Note about 64-bit sqlite	2018-03-18 18:25:08 -04:00
Colin Dellow	b0c7b229dd	Create queries from templates if needed	2018-03-18 17:50:39 -04:00
Colin Dellow	7f2042742b	Also compare queries against SQLite itself	2018-03-18 17:49:12 -04:00
Colin Dellow	e2af2a07a4	Make rowid start from 1, not 0 Unclear whether this is strictly required, but I'm going to start using SQLite as an oracle, and it'll be simpler if our rowids match theirs.	2018-03-18 17:03:46 -04:00
Colin Dellow	d430a45e41	Update README	2018-03-18 15:08:02 -04:00
Colin Dellow	1f3ffce560	Row group filtering for BYTE_ARRAY	2018-03-18 15:03:08 -04:00
Colin Dellow	7b302a0eb2	Bail on rowId constraint when non-int	2018-03-18 14:31:23 -04:00
Colin Dellow	078754467e	Generate queries from templates Huzzah, a bunch of failures have appeared.	2018-03-18 14:28:31 -04:00
Colin Dellow	e3f0dff083	Move queries/* to templates	2018-03-18 13:28:56 -04:00
Colin Dellow	65ea1b2f61	Rewrite tests for automatic generation Regularize the parquets - nulls and nonulls each come in 3 variants, with 1, 10 and 99 rows per rowgroup. All test queries are written against nullsA, no_nullsA. Next commit will introduce a tool to expand these template queries to go against the actual tables.	2018-03-18 13:11:29 -04:00
Colin Dellow	3b557f7fb0	Add explicit test for file not found ...caching the metadata moved where ParquetTable did I/O, which introduced a segfault on not found	2018-03-18 11:58:23 -04:00
Colin Dellow	4cbde9fc09	Row filtering for doubles	2018-03-17 16:09:57 -04:00
Colin Dellow	86e09b111e	Add row filtering for int32/64/96/boolean	2018-03-17 16:05:38 -04:00
Colin Dellow	a3af16eb54	Row-filtering for other string ops	2018-03-17 15:28:51 -04:00
Colin Dellow	03a20a9432	LIKE row group filtering ~1.7s -> ~1.0s for the census data set on `LIKE 'Dawson %'`	2018-03-17 00:11:38 -04:00
Colin Dellow	753a490687	Tests for blobs	2018-03-16 23:53:08 -04:00
Colin Dellow	01e8ffaba7	Row group filtering for double/float	2018-03-16 16:30:05 -04:00
Colin Dellow	9c22fd1f57	Row group filters for strings, int32/64/96, bools	2018-03-16 16:07:41 -04:00
Colin Dellow	cbf388698b	BOOL and INT96 tests	2018-03-16 16:02:11 -04:00
Colin Dellow	e87f0d0f68	Note about versions	2018-03-16 00:19:25 -04:00
Colin Dellow	1f4cebe2a6	Don't use accessors This drops the `= 'Dawson Creek'` query from 210ms to 145ms. Maybe inlining would have been an option here? I'm not familiar enough with g++ to know. :(	2018-03-15 23:04:11 -04:00
Colin Dellow	8ba13f44d5	Remove unnecessary copy Now the `== 'Dawson Creek'` query is ~210ms, which is approx the same as a `count(*)` query. This seems maybe OK, since the row group filter is only excluding 30% of records.	2018-03-15 22:10:45 -04:00
Colin Dellow	f7f1ed03d1	add row filter for string == This gets the census `== 'Dawson Creek'` query down to ~410ms from ~650ms. That still seems much slower than it should be. Am I accidentally doing a copy? Now to go learn how to profile C++ code...	2018-03-15 21:37:52 -04:00
Colin Dellow	6648ff5968	add string == row group filter For the statscan census set filtering on `== 'Dawson Creek'`, the query goes from 980ms to 660ms. This is expected, since the data isn't sorted by that column. I'll try adding some scaffolding to do filtering at the row level, too. We could also try unpacking the dictionary and testing the individual values, although we may want some heuristics to decide whether it's worth doing -- eg if < 10% of the rows have a unique value. Ideally, this should be like a ~1ms query.	2018-03-15 20:40:21 -04:00
Colin Dellow	dc431aee20	Dispatch row group filtering based on parquet type	2018-03-15 20:25:02 -04:00
Colin Dellow	92ba5f94e0	reuse FileMetaData For the statscan dataset, parsing the file metadata takes ~30-40ms, so stash it away for future re-use.	2018-03-15 19:57:38 -04:00
Colin Dellow	769060dbcb	Add stub row group filters for text/int/dbl Checkpointing to investigate why min/max stats for text aren't present	2018-03-12 23:07:41 -04:00

1 2 3

134 Commits