Parquet
Parquet
is an open source binary column-oriented data format that is very efficient.
Advantages
- Store Big Data.
- Can be parsed by AWS Athena when stored in AWS S3.
- Compact
- Data is fully typed
Disadvantages
- No schema evolution
Under the Hood
A Parquet
file is not as simple as a CSV file, for example. Where a CSV file has one header and many records, a Parquet
file has:
- File Metadata
- Row Groups
- Column Groups
- Pages
- Column Groups
- Row Groups
… where Plural indicates a One-to-Many relationship.
Encoding
Parquet
supports Encoding.
Compression
Parquet
s support Compression.
Below is a table where a set of data is stored in different file formats:
File Format | Compression | Query time (s) | Size (GB) |
---|---|---|---|
CSV | None | 2892.3 | 437.46 |
Parquet | Snappy | 28.9 | 54.83 |
Parquet | GZIP | 40.3 | 36.78 |
Parquet | None | 43.4 | 138.54 |
Parquet | LZO | 50.6 | 55.6 |
As you can see:
Parquet
outperforms CSV in every way.- Enabling Compression improves both the File Size as well as Performance.
Category | LZO | GZIP | Snappy |
---|---|---|---|
Compression Size | Medium | Smallest | Medium |
Compression Speed | Fast | Slow | Fastest |
Decompression Speed | Fastest | Slow | Fast |
Frequency of Data Usage | Hot | Cold | Hot |
Splitable | Yes | No | Yes |
Schemas and Data Types
Parquet
supports Schemas and Data Types.
There is one caveat, though: Schemas are not easily changed, if it is at all possible. It may be possible to make non-breakable changes (like adding a column)1, or it may not. Removing columns, renaming them, or changing their data type is a breaking change by definition and requires regenerating the Parquet
files (which can be immensely expensive if one has lots of data).
References
The video [Avro vs Parquet]( https://www.youtube.com/watch?v=UrWthx8T3UY](https://youtu.be/UrWthx8T3UY?t=292)4:52 suggests that
Parquets
only supports schema appends. ↩︎