mirror of https://github.com/cldellow/sqlite-parquet-vtable.git synced 2025-12-16 06:03:28 +00:00

Go to file

Colin Dellow d3ab5ff3e7 Cache clauses -> row group mapping

Create a shadow table. For `stats`, it'd be `_stats_rowgroups`.

It contains three columns:

- the clause (eg `city = 'Dawson Creek'`)
- the initial estimate, as a bitmap of rowgroups based on stats
- the actual observed rowgroups, as a bitmap

This papers over poorly sorted parquet files, at the cost of some disk
space. It makes interactive queries much more natural -- drilldown style
queries are much faster, as they can leverage work done by previous
queries.

eg 'SELECT * FROM stats WHERE city = 'Dawson Creek' and question_id >= 1935 and question_id <= 1940`
takes ~584ms on first run, but 9ms on subsequent runs.

We only create entries when the estimates don't match the actual
results.

Fixes #6

2018-03-24 23:57:15 -04:00

parquet

Cache clauses -> row group mapping

2018-03-24 23:57:15 -04:00

parquet-generator

Also compare queries against SQLite itself

2018-03-18 17:49:12 -04:00

tests

Cache clauses -> row group mapping

2018-03-24 23:57:15 -04:00

.gitignore

Add harness for low memory testing

2018-03-24 11:27:06 -04:00

build-sqlite

Add script to fetch+build sqlite

2018-03-02 18:46:40 -05:00

LICENSE

Initial commit

2018-03-02 18:37:08 -05:00

README.md

Cache clauses -> row group mapping

2018-03-24 23:57:15 -04:00

README.md

sqlite-parquet-vtable

A SQLite virtual table extension to expose Parquet files as SQL tables.

Building

Install parquet-cpp
1. Master appears to be broken for text row group stats; see https://github.com/cldellow/sqlite-parquet-vtable/issues/5 for which versions to use
Run ./build-sqlite to fetch and build the SQLite dev bits
Run ./parquet/make to build the module
1. You will need to fixup the paths in this file to point at your local parquet-cpp folder.

Tests

Run:

tests/create-queries-from-templates
tests/test-all

Use

$ sqlite/sqlite3
sqlite> .load parquet/libparquet
sqlite> CREATE VIRTUAL TABLE demo USING parquet('parquet-generator/99-rows-1.parquet');
sqlite> SELECT * FROM demo;
...if all goes well, you'll see data here!...

Note: if you get an error like:

sqlite> .load parquet/libparquet
Error: parquet/libparquet.so: wrong ELF class: ELFCLASS64

You have the 32-bit SQLite installed. To fix this, do:

sudo apt-get remove --purge sqlite3
sudo apt-get install sqlite3:amd64

Supported features

Row group filtering

Row group filtering is supported for strings and numerics so long as the SQLite type matches the Parquet type.

e.g. if you have a column foo that is an INT32, this query will skip row groups whose statistics prove that it does not contain relevant rows:

SELECT * FROM tbl WHERE foo = 123;

but this query will devolve to a table scan:

SELECT * FROM tbl WHERE foo = '123';

This is laziness on my part and could be fixed without too much effort.

Row filtering

For common constraints, the row is checked to see if it satisfies the query's constraints before returning control to SQLite's virtual machine. This minimizes the number of allocations performed when many rows are filtered out by the user's criteria.

Memoized slices

Individual clauses are mapped to the row groups they match.

eg going on row group statistics, which store minimum and maximum values, a clause like WHERE city = 'Dawson Creek' may match 80% of row groups.

In reality, it may only be present in one or two row groups.

This is recorded in a shadow table so future queries that contain that clause can read only the necessary row groups.

Types

These Parquet types are supported:

INT96 timestamps (exposed as milliseconds since the epoch)
INT8/INT16/INT32/INT64
UTF8 strings
BOOLEAN
FLOAT
DOUBLE
Variable- and fixed-length byte arrays

These are not currently supported:

UINT8/UINT16/UINT32/UINT64
DECIMAL