Paraquet

Parquet

Parquet is an open source file format built to handle flat columnar storage data formats. Parquet operates well with complex data in large volumes. It is known for its both performant data compression and its ability to handle a wide variety of encoding types.

Advantages of Parquet:

Columnar storage limits IO operations.
Columnar storage can fetch specific columns that you need to access.
Columnar storage consumes less space.
Columnar storage gives better-summarized data and follows type-specific encoding.

Disadvantages of Parquet:

Parquet stores the file schema in the file metadata. CSV files don't store file metadata, so readers need to either be supplied with the schema or the schema needs to be inferred. Supplying a schema is tedious and inferring a schema is error prone / expensive.

parquet-tools

This is a pip installable parquet-tools. In other words, parquet-tools is a CLI tools of Apache Arrow. You can show parquet file content/schema on local disk or on Amazon S3. It is incompatible with original parquet-tools.

Features

Read Parquet data (local file or file on S3)
Read Parquet metadata/schema (local file or file on S3)

Installation

                            
                               $ pip install parquet-tools

Usage

                            
                               $ parquet-tools --help
                                usage: parquet-tools [-h] {show,csv,inspect} ...
                                parquet CLI tools
                                positional arguments:
                                {show,csv,inspect}
                                show              Show human readable format. see `show -h`
                                csv               Cat csv style. see `csv -h`
                                inspect           Inspect parquet file. see `inspect -h`
                                optional arguments:
                                -h, --help          show this help message and exit

Usage Examples

Show local parquet file

                            
                               $ parquet-tools show test.parquet
                                +-------+-------+---------+
                                |   one | two   | three   |
                                |-------+-------+---------|
                                |  -1   | foo   | True    |
                                | nan   | bar   | False   |
                                |   2.5 | baz   | True    |
                                +-------+-------+---------+

Show parquet file on S3

                            
                               $ parquet-tools show s3://bucket-name/prefix/*
                                +-------+-------+---------+
                                |   one | two   | three   |
                                |-------+-------+---------|
                                |  -1   | foo   | True    |
                                | nan   | bar   | False   |
                                |   2.5 | baz   | True    |
                                +-------+-------+---------+

Inspect parquet file schema

                            
                               $ parquet-tools inspect /path/to/parquet

Inspect output

Cat CSV parquet and transform csvq

                            
                               $ parquet-tools csv s3://bucket-name/test.parquet |csvq "select one, three where three"
                                +-------+-------+
                                |  one  | three |
                                +-------+-------+
                                | -1.0  | True  |
                                | 2.5   | True  |
                                +-------+-------+

Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. They believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and they are not interested in playing favorites. They believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

Paraquet

Description

Installation

parquet-tools

Features

Installation

Usage

Usage Examples

Show local parquet file

Show parquet file on S3

Inspect parquet file schema

Cat CSV parquet and transform csvq

Summary