Cache clauses -> row group mapping

Create a shadow table. For `stats`, it'd be `_stats_rowgroups`. It contains three columns: - the clause (eg `city = 'Dawson Creek'`) - the initial estimate, as a bitmap of rowgroups based on stats - the actual observed rowgroups, as a bitmap This papers over poorly sorted parquet files, at the cost of some disk space. It makes interactive queries much more natural -- drilldown style queries are much faster, as they can leverage work done by previous queries. eg 'SELECT * FROM stats WHERE city = 'Dawson Creek' and question_id >= 1935 and question_id <= 1940` takes ~584ms on first run, but 9ms on subsequent runs. We only create entries when the estimates don't match the actual results. Fixes #6
2025-12-22 06:33:29 +00:00 · 2018-03-24 23:51:15 -04:00
parent d2c736f25a
commit d3ab5ff3e7
9 changed files with 397 additions and 63 deletions
--- a/parquet/parquet_table.cc
+++ b/parquet/parquet_table.cc
@@ -2,9 +2,7 @@

 #include "parquet/api/reader.h"

-ParquetTable::ParquetTable(std::string file) {
-  this->file = file;
-
+ParquetTable::ParquetTable(std::string file, std::string tableName): file(file), tableName(tableName) {
  std::unique_ptr<parquet::ParquetFileReader> reader = parquet::ParquetFileReader::OpenFile(file.data());
  metadata = reader->metadata();
 }
@@ -138,3 +136,6 @@ std::string ParquetTable::CreateStatement() {
 }

 std::shared_ptr<parquet::FileMetaData> ParquetTable::getMetadata() { return metadata; }
+
+const std::string& ParquetTable::getFile() { return file; }
+const std::string& ParquetTable::getTableName() { return tableName; }