Fundamentals
Set theme to dark (⇧+D)

Parquet

Parquet is an open source binary column-oriented data format that is very efficient.

​​ Advantages

  • Store Big Data.
  • Can be parsed by AWS Athena when stored in AWS S3.
  • Compact
  • Data is fully typed

​​ Disadvantages

  • No schema evolution

​​ Under the Hood

A Parquet file is not as simple as a CSV file, for example. Where a CSV file has one header and many records, a Parquet file has:

  • File Metadata
    • Row Groups
      • Column Groups
        • Pages

… where Plural indicates a One-to-Many relationship.

​​ Encoding

Parquet supports Encoding.

​​ Compression

Parquets support Compression.

Below is a table where a set of data is stored in different file formats:

File FormatCompressionQuery time (s)Size (GB)
CSVNone2892.3437.46
ParquetSnappy28.954.83
ParquetGZIP40.336.78
ParquetNone43.4138.54
ParquetLZO50.655.6

As you can see:

  • Parquet outperforms CSV in every way.
  • Enabling Compression improves both the File Size as well as Performance.
CategoryLZOGZIPSnappy
Compression SizeMediumSmallestMedium
Compression SpeedFastSlowFastest
Decompression SpeedFastestSlowFast
Frequency of Data UsageHotColdHot
SplitableYesNoYes

​​ Schemas and Data Types

Parquet supports Schemas and Data Types.

There is one caveat, though: Schemas are not easily changed, if it is at all possible. It may be possible to make non-breakable changes (like adding a column)1, or it may not. Removing columns, renaming them, or changing their data type is a breaking change by definition and requires regenerating the Parquet files (which can be immensely expensive if one has lots of data).

​​ References


  1. The video [Avro vs Parquet]( https://www.youtube.com/watch?v=UrWthx8T3UY](https://youtu.be/UrWthx8T3UY?t=292)4:52 suggests that Parquets only supports schema appends↩︎