2018-03-04 01:26:41 +00:00
|
|
|
# sqlite-parquet-vtable
|
2018-03-02 23:59:34 +00:00
|
|
|
|
|
|
|
A SQLite [virtual table](https://sqlite.org/vtab.html) extension to expose Parquet files as SQL tables.
|
|
|
|
|
2018-03-04 01:26:41 +00:00
|
|
|
## Caveats
|
|
|
|
|
2018-03-04 18:04:58 +00:00
|
|
|
I'm not an experienced C/C++ programmer. This library is definitely not bombproof. It's good enough for my use case,
|
|
|
|
and may be good enough for yours, too.
|
2018-03-04 01:26:41 +00:00
|
|
|
|
|
|
|
* I don't use `sqlite3_malloc` and `sqlite3_free` for C++ objects
|
|
|
|
* Maybe this doesn't matter, since portability isn't a goal
|
2018-03-04 18:04:58 +00:00
|
|
|
* The C -> C++ interop definitely leaks some C++ exceptions
|
|
|
|
* Obvious cases like file not found and unsupported Parquet types are OK
|
|
|
|
* Low memory conditions aren't handled gracefully.
|
2018-03-04 01:26:41 +00:00
|
|
|
|
2018-03-02 23:59:34 +00:00
|
|
|
## Building
|
|
|
|
|
|
|
|
1. Install [`parquet-cpp`](https://github.com/apache/parquet-cpp)
|
2018-03-16 04:12:12 +00:00
|
|
|
1. Master appears to be broken for text row group stats; see https://github.com/cldellow/sqlite-parquet-vtable/issues/5 for which versions to use
|
2018-03-02 23:59:34 +00:00
|
|
|
2. Run `./build-sqlite` to fetch and build the SQLite dev bits
|
|
|
|
3. Run `./parquet/make` to build the module
|
2018-03-16 04:12:12 +00:00
|
|
|
1. You will need to fixup the paths in this file to point at your local parquet-cpp folder.
|
2018-03-02 23:59:34 +00:00
|
|
|
|
2018-03-18 19:08:02 +00:00
|
|
|
## Tests
|
|
|
|
|
|
|
|
Run:
|
|
|
|
|
|
|
|
```
|
|
|
|
tests/create-queries-from-templates
|
|
|
|
tests/test-all
|
|
|
|
```
|
|
|
|
|
2018-03-02 23:59:34 +00:00
|
|
|
## Use
|
|
|
|
|
|
|
|
```
|
|
|
|
$ sqlite/sqlite3
|
|
|
|
sqlite> .load parquet/libparquet
|
2018-03-18 19:08:02 +00:00
|
|
|
sqlite> CREATE VIRTUAL TABLE demo USING parquet('parquet-generator/99-rows-1.parquet');
|
2018-03-04 18:06:50 +00:00
|
|
|
sqlite> SELECT * FROM demo;
|
2018-03-02 23:59:34 +00:00
|
|
|
...if all goes well, you'll see data here!...
|
|
|
|
```
|
2018-03-04 01:26:41 +00:00
|
|
|
|
|
|
|
## Supported features
|
|
|
|
|
2018-03-16 20:30:05 +00:00
|
|
|
### Row group filtering
|
2018-03-04 01:26:41 +00:00
|
|
|
|
2018-03-16 20:30:05 +00:00
|
|
|
Row group filtering is supported for strings and numerics so long as the SQLite
|
|
|
|
type matches the Parquet type.
|
|
|
|
|
|
|
|
e.g. if you have a column `foo` that is an INT32, this query will skip row groups whose
|
|
|
|
statistics prove that it does not contain relevant rows:
|
|
|
|
|
|
|
|
```
|
|
|
|
SELECT * FROM tbl WHERE foo = 123;
|
|
|
|
```
|
|
|
|
|
|
|
|
but this query will devolve to a table scan:
|
|
|
|
|
|
|
|
```
|
|
|
|
SELECT * FROM tbl WHERE foo = '123';
|
|
|
|
```
|
|
|
|
|
|
|
|
This is laziness on my part and could be fixed without too much effort.
|
|
|
|
|
|
|
|
### Row filtering
|
|
|
|
|
|
|
|
For common constraints, the row is checked to see if it satisfies the query's
|
|
|
|
constraints before returning control to SQLite's virtual machine. This minimizes
|
|
|
|
the number of allocations performed when many rows are filtered out by
|
|
|
|
the user's criteria.
|
2018-03-04 01:26:41 +00:00
|
|
|
|
|
|
|
### Types
|
|
|
|
|
2018-03-16 20:30:05 +00:00
|
|
|
These Parquet types are supported:
|
2018-03-04 01:26:41 +00:00
|
|
|
|
|
|
|
* INT96 timestamps (exposed as milliseconds since the epoch)
|
|
|
|
* INT8/INT16/INT32/INT64
|
|
|
|
* UTF8 strings
|
|
|
|
* BOOLEAN
|
|
|
|
* FLOAT
|
|
|
|
* DOUBLE
|
2018-03-04 22:20:28 +00:00
|
|
|
* Variable- and fixed-length byte arrays
|
2018-03-04 01:26:41 +00:00
|
|
|
|
2018-03-16 20:30:05 +00:00
|
|
|
These are not currently supported:
|
2018-03-04 01:26:41 +00:00
|
|
|
|
|
|
|
* UINT8/UINT16/UINT32/UINT64
|
|
|
|
* DECIMAL
|