sqlite-parquet-vtable/README.md

88 lines
2.4 KiB
Markdown
Raw Normal View History

2018-03-04 01:26:41 +00:00
# sqlite-parquet-vtable
A SQLite [virtual table](https://sqlite.org/vtab.html) extension to expose Parquet files as SQL tables.
2018-03-04 01:26:41 +00:00
## Caveats
2018-03-04 18:04:58 +00:00
I'm not an experienced C/C++ programmer. This library is definitely not bombproof. It's good enough for my use case,
and may be good enough for yours, too.
2018-03-04 01:26:41 +00:00
* I don't use `sqlite3_malloc` and `sqlite3_free` for C++ objects
* Maybe this doesn't matter, since portability isn't a goal
2018-03-04 18:04:58 +00:00
* The C -> C++ interop definitely leaks some C++ exceptions
* Obvious cases like file not found and unsupported Parquet types are OK
* Low memory conditions aren't handled gracefully.
2018-03-04 01:26:41 +00:00
## Building
1. Install [`parquet-cpp`](https://github.com/apache/parquet-cpp)
2018-03-16 04:12:12 +00:00
1. Master appears to be broken for text row group stats; see https://github.com/cldellow/sqlite-parquet-vtable/issues/5 for which versions to use
2. Run `./build-sqlite` to fetch and build the SQLite dev bits
3. Run `./parquet/make` to build the module
2018-03-16 04:12:12 +00:00
1. You will need to fixup the paths in this file to point at your local parquet-cpp folder.
2018-03-18 19:08:02 +00:00
## Tests
Run:
```
tests/create-queries-from-templates
tests/test-all
```
## Use
```
$ sqlite/sqlite3
sqlite> .load parquet/libparquet
2018-03-18 19:08:02 +00:00
sqlite> CREATE VIRTUAL TABLE demo USING parquet('parquet-generator/99-rows-1.parquet');
2018-03-04 18:06:50 +00:00
sqlite> SELECT * FROM demo;
...if all goes well, you'll see data here!...
```
2018-03-04 01:26:41 +00:00
## Supported features
2018-03-16 20:30:05 +00:00
### Row group filtering
2018-03-04 01:26:41 +00:00
2018-03-16 20:30:05 +00:00
Row group filtering is supported for strings and numerics so long as the SQLite
type matches the Parquet type.
e.g. if you have a column `foo` that is an INT32, this query will skip row groups whose
statistics prove that it does not contain relevant rows:
```
SELECT * FROM tbl WHERE foo = 123;
```
but this query will devolve to a table scan:
```
SELECT * FROM tbl WHERE foo = '123';
```
This is laziness on my part and could be fixed without too much effort.
### Row filtering
For common constraints, the row is checked to see if it satisfies the query's
constraints before returning control to SQLite's virtual machine. This minimizes
the number of allocations performed when many rows are filtered out by
the user's criteria.
2018-03-04 01:26:41 +00:00
### Types
2018-03-16 20:30:05 +00:00
These Parquet types are supported:
2018-03-04 01:26:41 +00:00
* INT96 timestamps (exposed as milliseconds since the epoch)
* INT8/INT16/INT32/INT64
* UTF8 strings
* BOOLEAN
* FLOAT
* DOUBLE
2018-03-04 22:20:28 +00:00
* Variable- and fixed-length byte arrays
2018-03-04 01:26:41 +00:00
2018-03-16 20:30:05 +00:00
These are not currently supported:
2018-03-04 01:26:41 +00:00
* UINT8/UINT16/UINT32/UINT64
* DECIMAL