A SQLite vtable extension to read Parquet files
Go to file
Colin Dellow 6648ff5968 add string == row group filter
For the statscan census set filtering on `== 'Dawson Creek'`, the query
goes from 980ms to 660ms.

This is expected, since the data isn't sorted by that column.

I'll try adding some scaffolding to do filtering at the row level, too.

We could also try unpacking the dictionary and testing the individual
values, although we may want some heuristics to decide whether it's
worth doing -- eg if < 10% of the rows have a unique value.

Ideally, this should be like a ~1ms query.
2018-03-15 20:40:21 -04:00
parquet add string == row group filter 2018-03-15 20:40:21 -04:00
parquet-generator Fix when last rowgroup is not same size as first 2018-03-11 15:15:27 -04:00
tests row group skipping for is [not] null queries 2018-03-12 21:09:00 -04:00
.gitignore test-queries: can debug a testcase 2018-03-10 11:54:36 -05:00
LICENSE Initial commit 2018-03-02 18:37:08 -05:00
README.md Support BLOBs 2018-03-04 17:20:59 -05:00
build-sqlite Add script to fetch+build sqlite 2018-03-02 18:46:40 -05:00

README.md

sqlite-parquet-vtable

A SQLite virtual table extension to expose Parquet files as SQL tables.

Caveats

I'm not an experienced C/C++ programmer. This library is definitely not bombproof. It's good enough for my use case, and may be good enough for yours, too.

  • I don't use sqlite3_malloc and sqlite3_free for C++ objects
    • Maybe this doesn't matter, since portability isn't a goal
  • The C -> C++ interop definitely leaks some C++ exceptions
    • Obvious cases like file not found and unsupported Parquet types are OK
    • Low memory conditions aren't handled gracefully.

Building

  1. Install parquet-cpp
  2. Run ./build-sqlite to fetch and build the SQLite dev bits
  3. Run ./parquet/make to build the module
  4. You will need to fixup the paths in this file to point at your local parquet-cpp folder.

Use

$ sqlite/sqlite3
sqlite> .load parquet/libparquet
sqlite> CREATE VIRTUAL TABLE demo USING parquet('parquet-generator/100-rows-1.parquet');
sqlite> SELECT * FROM demo;
...if all goes well, you'll see data here!...

Supported features

Index

Only full table scans are supported.

Types

These types are supported:

  • INT96 timestamps (exposed as milliseconds since the epoch)
  • INT8/INT16/INT32/INT64
  • UTF8 strings
  • BOOLEAN
  • FLOAT
  • DOUBLE
  • Variable- and fixed-length byte arrays

These are not supported:

  • UINT8/UINT16/UINT32/UINT64
  • DECIMAL