sqlite-parquet-vtable

mirror of https://github.com/cldellow/sqlite-parquet-vtable.git synced 2025-07-08 17:23:30 +00:00

Author	SHA1	Message	Date
Colin Dellow	1d0d4c08b8	Build sqlite in parallel	2018-06-26 23:05:30 -04:00
Colin Dellow	76fb058dd7	Link Boost statically Fixes #15	2018-06-26 22:44:50 -04:00
Colin Dellow	263a6af7ec	Use Arrow's compression libraries Fixes #27	2018-06-26 08:17:18 -04:00
Colin Dellow	129ff4e694	Merge pull request #25 from evansd/makefile-fixes Makefile fixes	2018-06-25 13:54:29 -04:00
David Evans	b7da04433b	Include header locations we need	2018-06-25 18:20:24 +01:00
David Evans	ab87b13b75	Reverse prereqs order to get build to work	2018-06-25 18:20:04 +01:00
Colin Dellow	bc0be71546	Add brotli/snappy/gzip test files `test/test-supported` verifies they can be opened	2018-06-25 08:32:36 -04:00
Colin Dellow	0bdcc9895e	All-in-one build command `./make-linux` clones and builds: - arrow - brotli - lz4 - parquet - snappy - zlib - zstd - this project as a statically linked binary. Two Boost libs are still pulled in as shared libs, should probably fix that, too, for ultimate portability.	2018-06-24 21:11:07 -04:00
Colin Dellow	ec6e970bbc	Fix `order by rowid` to apply w/o clause Fixes #12, first screen of datasette is fast now	2018-06-24 15:20:06 -04:00
Colin Dellow	5b59ba02fe	Make ORDER BY ROWID fast Fixes #11	2018-06-24 15:07:27 -04:00
Colin Dellow	b774973852	Avoid row filter check when no constraints The function call overhead is expensive! This makes count(*) on the census data 175ms instead of 225ms, while not significantly impacting other use cases.	2018-06-24 14:51:54 -04:00
Colin Dellow	84a22e6e77	link to blog	2018-06-24 11:39:44 -04:00
Colin Dellow	16cdd70f2b	Short-circuit row group evaluation We can avoid eagerly computing bitmasks for other constraints this way. Possible future work - order the constraints such that we evaluate the one that is cheapest/most likely to prune a row group first. This reduces the cyclist query from ~65ms to ~60ms	2018-06-24 11:08:56 -04:00
Colin Dellow	fd87c44ccd	Add link to csv2parquet	2018-06-23 23:58:13 -04:00
Colin Dellow	e1a86954e5	Revert "Don't eagerly evaluate constraints" This reverts commit cbde3c73b601383fff33ce501d0b26047326e93f. This regresses: ``` WITH inputs AS ( SELECT geo_name, CASE WHEN profile_id = 1930 THEN 'total' ELSE 'cyclist' END AS mode, female, male FROM census WHERE profile_id IN ( '1930', '1935') AND csd_type_name = 'CY' AND geo_name IN ('Victoria', 'Dawson Creek', 'Kitchener') ) SELECT total.geo_name, cyclist.male, cyclist.female, 100.0 * cyclist.male / total.male, 100.0 * cyclist.female / total.female FROM inputs AS total JOIN inputs AS cyclist USING (geo_name) WHERE total.mode = 'total' AND cyclist.mode = 'cyclist'; ``` while improving: ``` select count(*) from census where geo_name in ('Dawson Creek', 'Kitchener', 'Victoria') and csd_type_name = 'CY' and profile_id = '1930'; ``` which seems like a bad tradeoff.	2018-06-23 20:48:39 -04:00
Colin Dellow	603153c36c	avoid looking up physical type	2018-06-23 20:42:38 -04:00
Colin Dellow	cbde3c73b6	Don't eagerly evaluate constraints ...to avoid decompressing columns when we know from previous columns that the row can't match. Fixes #10	2018-06-23 20:31:03 -04:00
Colin Dellow	d7c5002cee	Move some code out of ensureColumn Saves ~4% on the cold census needle query (~425ms -> ~405ms)	2018-06-23 19:10:23 -04:00
Colin Dellow	b9c58bd97e	persist row group clauses on EOF ...not on close. Fixes #9	2018-06-23 16:25:56 -04:00
Colin Dellow	6d4be61261	tweak Makefile	2018-06-23 16:13:18 -04:00
Colin Dellow	596496c9cb	rejig README	2018-03-25 00:07:56 -04:00
Colin Dellow	d3ab5ff3e7	Cache clauses -> row group mapping Create a shadow table. For `stats`, it'd be `_stats_rowgroups`. It contains three columns: - the clause (eg `city = 'Dawson Creek'`) - the initial estimate, as a bitmap of rowgroups based on stats - the actual observed rowgroups, as a bitmap This papers over poorly sorted parquet files, at the cost of some disk space. It makes interactive queries much more natural -- drilldown style queries are much faster, as they can leverage work done by previous queries. eg 'SELECT * FROM stats WHERE city = 'Dawson Creek' and question_id >= 1935 and question_id <= 1940` takes ~584ms on first run, but 9ms on subsequent runs. We only create entries when the estimates don't match the actual results. Fixes #6	2018-03-24 23:57:15 -04:00
Colin Dellow	d2c736f25a	Add LIMIT/OFFSET to random queries	2018-03-24 19:02:30 -04:00
Colin Dellow	cafd087113	Update README	2018-03-24 12:49:03 -04:00
Colin Dellow	51d0f27a68	don't segfault on low memory Fixes #8	2018-03-24 12:48:29 -04:00
Colin Dellow	6fa7bc3d0b	Add harness for low memory testing	2018-03-24 11:27:06 -04:00
Colin Dellow	599430b2f4	Add #ifdefs around printfs	2018-03-20 19:57:12 -04:00
Colin Dellow	5480de7fb6	Compile w/static linkages for parquet Fixes #4. A stock Ubuntu 14.04 can now install sqlite3:amd64 and libboost-all-dev, then use this module to read the test parquet file.	2018-03-20 19:06:39 -04:00
Colin Dellow	8bf890ab66	Fix incorrect row pruning for non-text BYTE_ARRAY	2018-03-18 19:43:09 -04:00
Colin Dellow	893e4c63f5	Add testcase generator Very simplistics - select M fields, filters on N fields, slight bias to use values of same type of the field it's comparing against. No segfaults yet, but one test case that generates differing output when run against `nulls` and `nulls1`: ``` select rowid from nulls1 where binary_9 >= '56' and ts_5 < 496886400000; ```	2018-03-18 19:11:26 -04:00
Colin Dellow	045e17da34	Note about 64-bit sqlite	2018-03-18 18:25:08 -04:00
Colin Dellow	b0c7b229dd	Create queries from templates if needed	2018-03-18 17:50:39 -04:00
Colin Dellow	7f2042742b	Also compare queries against SQLite itself	2018-03-18 17:49:12 -04:00
Colin Dellow	e2af2a07a4	Make rowid start from 1, not 0 Unclear whether this is strictly required, but I'm going to start using SQLite as an oracle, and it'll be simpler if our rowids match theirs.	2018-03-18 17:03:46 -04:00
Colin Dellow	d430a45e41	Update README	2018-03-18 15:08:02 -04:00
Colin Dellow	1f3ffce560	Row group filtering for BYTE_ARRAY	2018-03-18 15:03:08 -04:00
Colin Dellow	7b302a0eb2	Bail on rowId constraint when non-int	2018-03-18 14:31:23 -04:00
Colin Dellow	078754467e	Generate queries from templates Huzzah, a bunch of failures have appeared.	2018-03-18 14:28:31 -04:00
Colin Dellow	e3f0dff083	Move queries/* to templates	2018-03-18 13:28:56 -04:00
Colin Dellow	65ea1b2f61	Rewrite tests for automatic generation Regularize the parquets - nulls and nonulls each come in 3 variants, with 1, 10 and 99 rows per rowgroup. All test queries are written against nullsA, no_nullsA. Next commit will introduce a tool to expand these template queries to go against the actual tables.	2018-03-18 13:11:29 -04:00
Colin Dellow	3b557f7fb0	Add explicit test for file not found ...caching the metadata moved where ParquetTable did I/O, which introduced a segfault on not found	2018-03-18 11:58:23 -04:00
Colin Dellow	4cbde9fc09	Row filtering for doubles	2018-03-17 16:09:57 -04:00
Colin Dellow	86e09b111e	Add row filtering for int32/64/96/boolean	2018-03-17 16:05:38 -04:00
Colin Dellow	a3af16eb54	Row-filtering for other string ops	2018-03-17 15:28:51 -04:00
Colin Dellow	03a20a9432	LIKE row group filtering ~1.7s -> ~1.0s for the census data set on `LIKE 'Dawson %'`	2018-03-17 00:11:38 -04:00
Colin Dellow	753a490687	Tests for blobs	2018-03-16 23:53:08 -04:00
Colin Dellow	01e8ffaba7	Row group filtering for double/float	2018-03-16 16:30:05 -04:00
Colin Dellow	9c22fd1f57	Row group filters for strings, int32/64/96, bools	2018-03-16 16:07:41 -04:00
Colin Dellow	cbf388698b	BOOL and INT96 tests	2018-03-16 16:02:11 -04:00
Colin Dellow	e87f0d0f68	Note about versions	2018-03-16 00:19:25 -04:00

1 2

91 Commits