Presto Tuning Notes – Hive Table Partitioning and Bucketing

Sources:

  • https://community.hortonworks.com/content/supportkb/49637/hive-bucketing-and-partitioning.html
  • https://prestodb.io/docs/current/connector/hive.html

To better understand how partitioning and bucketing works, please take a look at how data is stored in hive. Let’s say you have a table

  1. <code>CREATE TABLE mytable (
  2. name string,
  3. city string,
  4. employee_id int )
  5. PARTITIONED BY (year STRING, month STRING, day STRING)
  6. CLUSTERED BY (employee_id) INTO 256 BUCKETS

You insert some data into a partition for 2015-12-02. Hive will then store data in a directory hierarchy, such as:

  1. <code>/user/hive/warehouse/mytable/y=2015/m=12/d=02

As such, it is important to be careful when partitioning. As a general rule of thumb, when choosing a field for partitioning, the field should not have a high cardinality – the term ‘cardinality‘ refers to the number of possible values a field can have. For instance, if you have a ‘country’ field, the countries in the world are about 300, so cardinality would be ~300. For a field like ‘timestamp_ms’, which changes every millisecond, cardinality can be billions. The cardinality of the field relates to the number of directories that could be created on the file system. As an example, if you partition by employee_id and you have millions of employees, you may end up having millions of directories in your file system.

Clustering, aka bucketing, on the other hand, will result in a fixed number of files, since you specify the number of buckets. What hive will do is to take the field, calculate a hash and assign a record to that bucket.

FAQ What happens if you use e.g. 256 buckets and the field you’re bucketing on has a low cardinality (for instance, it’s a US state, so can be only 50 different values? You’ll have 50 buckets with data, and 206 buckets with no data.

Can partitions dramatically cut the amount of data that is being queried? In the example table, if you want to query only from a certain date forward, the partitioning by year/month/day is going to dramatically cut the amount of IO.

Can bucketing can speed up joins with other tables that have exactly the same bucketing? In the above example, if you’re joining two tables on the same employee_id, hive can do the join bucket by bucket (even better if they’re already sorted by employee_id since it’s going to do a mergesort which works in linear time).

So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.

Also, you can partition on multiple fields, with an order (year/month/day is a good example), while you can bucket on only one field.

Presto Examples

The Hive connector supports querying and manipulating Hive tables and schemas (databases). While some uncommon operations will need to be performed using Hive directly, most operations can be performed using Presto.

Create a new Hive schema named web that will store tables in an S3 bucket named my-bucket:

Create a new Hive table named page_views in the web schema that is stored using the ORC file format, partitioned by date and country, and bucketed by user into 50 buckets (note that Hive requires the partition columns to be the last columns in the table):

Drop a partition from the page_views table:

Query the page_views table:

Create an external Hive table named request_logs that points at existing data in S3:

Drop the external table request_logs. This only drops the metadata for the table. The referenced data directory is not deleted:

Drop a schema:

This post was last modified on Tháng mười một 19, 2024 11:12 chiều