Apache Parquet is a columnar storage format commonly used in the Hadoop ecosystem. If you work in Big Data space, you probably work with Parquet files. Unlike commonly used data storage formats like CSV, JSON etc Parquet doesn't have tools needed to quickly preview and inspect. I often needed to write Spark or Python code just to do very simple debugging.
In order to solve this problem, I created a CLI tool aptly named parquet-cli (parq
as command). It is released on PyPi and can be conveniently installed using pip: pip install parquet-cli
Initial features
It currently supports basic but very useful feature set to work with Parquet files. They are:
- view file metadata
- get schema information
- get total count of rows in a file
- get top N records (head)
- get bottom N records (tail)
It only works with single file as of now. However, I am planning to support for directories. It means you can give path to partitioned directory and parq
should still work in similar way as for single file.
I wanted this tool to be very easy to install. Thus, I specifically tried to keep dependencies very minimal. For example, I really like click but it has many third party dependencies, thus I decided to use built-in library argparse for CLI parsing. Only hard dependencies are Apache Arrow (reading Parquet files) and pandas (manipulating them). They both are part of Python Data stack and are well maintained.
This initial feature set is something that I need. If you have any suggestion or found any bugs, you can open ticket on Github. Needless to day, any code contribution is very welcome too.
Posted on Utopian.io - Rewarding Open Source Contributors
Congratulations @chhantyal! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
Award for the number of upvotes received
Click on any badge to view your own Board of Honor on SteemitBoard.
To support your work, I also upvoted your post!
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP
Hi, your contribution was rejected because I found out this this already exists, so I thought maybe your project is a little bit redundant. The only difference currently seems to be the
--tail
command. Since I don't know anything about Parquet I asked others about it and they agreed it wasn't unique enough to be accepted - if you continue working on this project please highlight reasons why it's unique in future contributions.Also, have you ever heard of Click? It's made by Armin Ronacher, the same guy who made Flask, and it's amazing! I'd definitely recommend using that over
argparse
when creating a CLI.Need help? Write a ticket on https://support.utopian.io.
Chat with us on Discord.
[utopian-moderator]
Fair enough. A colleague sent me that link after I released this tool. I understand where you are coming from.
I think it's more of a convenience thing, I don't think anyone using Python data stack would be up to install & build big java source code so that he/she could check some file contents.