6648ff5968
For the statscan census set filtering on `== 'Dawson Creek'`, the query goes from 980ms to 660ms. This is expected, since the data isn't sorted by that column. I'll try adding some scaffolding to do filtering at the row level, too. We could also try unpacking the dictionary and testing the individual values, although we may want some heuristics to decide whether it's worth doing -- eg if < 10% of the rows have a unique value. Ideally, this should be like a ~1ms query. |
||
---|---|---|
parquet | ||
parquet-generator | ||
tests | ||
.gitignore | ||
LICENSE | ||
README.md | ||
build-sqlite |
README.md
sqlite-parquet-vtable
A SQLite virtual table extension to expose Parquet files as SQL tables.
Caveats
I'm not an experienced C/C++ programmer. This library is definitely not bombproof. It's good enough for my use case, and may be good enough for yours, too.
- I don't use
sqlite3_malloc
andsqlite3_free
for C++ objects- Maybe this doesn't matter, since portability isn't a goal
- The C -> C++ interop definitely leaks some C++ exceptions
- Obvious cases like file not found and unsupported Parquet types are OK
- Low memory conditions aren't handled gracefully.
Building
- Install
parquet-cpp
- Run
./build-sqlite
to fetch and build the SQLite dev bits - Run
./parquet/make
to build the module - You will need to fixup the paths in this file to point at your local parquet-cpp folder.
Use
$ sqlite/sqlite3
sqlite> .load parquet/libparquet
sqlite> CREATE VIRTUAL TABLE demo USING parquet('parquet-generator/100-rows-1.parquet');
sqlite> SELECT * FROM demo;
...if all goes well, you'll see data here!...
Supported features
Index
Only full table scans are supported.
Types
These types are supported:
- INT96 timestamps (exposed as milliseconds since the epoch)
- INT8/INT16/INT32/INT64
- UTF8 strings
- BOOLEAN
- FLOAT
- DOUBLE
- Variable- and fixed-length byte arrays
These are not supported:
- UINT8/UINT16/UINT32/UINT64
- DECIMAL