Wsrtjtyk

Question

I'm kinda of new to Hive and Hadoop . I have a query which is taking 10 minutes to complete the query .

Size of the data is 10GB
Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE

Partition and Bucketing is done in the table .

How can I improve the below query .

select * fromtbl1 where clmn='Abdul' and loc='IND' and TO_UNIX_TIMESTAMP(ts) > (UNIX_TIMESTAMP() - 5*60*60);

set hive.vectorized.execution.reduce.enabled=true;

set hive.tez.container.size=8192;

set hive.fetch.task.conversion = none;

set mapred.compress.map.output=true;

set mapred.output.compress=true;

set hive.fetch.task.conversion=none;





-----------+--+

|                                                                                                           Explain                                                                                                           |

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+

| Plan not optimized by CBO.                                                                                                                                                                                                  |

|                                                                                                                                                                                                                             |

| Stage-0                                                                                                                                                                                                                     |

|    Fetch Operator                                                                                                                                                                                                           |

|       limit:-1                                                                                                                                                                                                              |

|       Stage-1                                                                                                                                                                                                               |

|          Map 1                                                                                                                                                                                                              |

|          File Output Operator [FS_2973]                                                                                                                                                                                     |

|             compressed:false                                                                                                                                                                                                |

|             Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE                                                                                                                     |

|             table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}  |

|             Select Operator [SEL_2972]                                                                                                                                                                                      |

|                outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]                                                                                                                          |

|                Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE                                                                                                                  |

|                Filter Operator [FIL_2971]                                                                                                                                                                                   |

|                   predicate:((section = 'xysaa') and (to_unix_timestamp(ts) > (unix_timestamp() - 18000))) (type: boolean)                                                                                               |

|                   Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE                                                                                                               |

|                   TableScan [TS_2970]                                                                                                                                                                                       |

|                      ACID table:true                                                                                                                                                                                        |

|                      alias:pp                                                                                                                                                                              |

|                      Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE                                                                                                        |

|                                                                                                                                                                                                                             |

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+

None of the parameters helped us to resolve the query in shorter period of time .

score 0 · Accepted Answer · 2018-11-21 20:07:07Z

According to the plan, query runs on mapper, vectorizing is not enabled. Try this:

set hive.vectorized.execution.enabled = true;

set hive.vectorized.execution.reduce.enabled=true;

Tune mapper parallelism:

set tez.grouping.max-size=67108864;

set tez.grouping.min-size=32000000;

Play with these settings to increase the number of mappers running. Ideally it should run without this setting:

set hive.tez.container.size=8192;

One more recommendation is to replace unix_timestamp() with UNIX_TIMESTAMP(current_timestamp). This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favor of CURRENT_TIMESTAMP constant.

(UNIX_TIMESTAMP(current_timestamp) - 5*60*60)

Also your files are very small. the size of partition is 200-500, 12 files per partition, 20-50Mb is the file size. Fortunately it is ORC and you can concatenate files using ALTER TABLE CONCATENATE COMMAND.
12 files is not a big deal and you probably will not notice an improvement when querying single partition.

See also this answer: https://stackoverflow.com/a/48487306/2700344

No Improvement Mate . Still the same . I was wondering why it is taking 10 minutes to complete the query when the data is huge in TB also taking same time to return the results . How can I tune this query faster . — Nov 20 '18 at 16:35
@Varshini Please answer these questions: How many files and what is file format and compression in the single partition, what is partitioned column? — Nov 20 '18 at 17:15
Partitioned column is loc and compression is ZLIB with ORC File format. There are 40-50 partitions. — Nov 21 '18 at 18:24
@Varshini check also please how many files per partition and their size — Nov 21 '18 at 18:33
There are 12 files in almost all the partitions . Size of every partition would be 200-500 MB and total size of the table is 10GB . — Nov 21 '18 at 19:25

score 0 · Accepted Answer · 2018-11-21 20:07:07Z

According to the plan, query runs on mapper, vectorizing is not enabled. Try this:

set hive.vectorized.execution.enabled = true;

set hive.vectorized.execution.reduce.enabled=true;

Tune mapper parallelism:

set tez.grouping.max-size=67108864;

set tez.grouping.min-size=32000000;

Play with these settings to increase the number of mappers running. Ideally it should run without this setting:

set hive.tez.container.size=8192;

One more recommendation is to replace unix_timestamp() with UNIX_TIMESTAMP(current_timestamp). This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favor of CURRENT_TIMESTAMP constant.

(UNIX_TIMESTAMP(current_timestamp) - 5*60*60)

Also your files are very small. the size of partition is 200-500, 12 files per partition, 20-50Mb is the file size. Fortunately it is ORC and you can concatenate files using ALTER TABLE CONCATENATE COMMAND.
12 files is not a big deal and you probably will not notice an improvement when querying single partition.

See also this answer: https://stackoverflow.com/a/48487306/2700344

No Improvement Mate . Still the same . I was wondering why it is taking 10 minutes to complete the query when the data is huge in TB also taking same time to return the results . How can I tune this query faster . — Nov 20 '18 at 16:35
@Varshini Please answer these questions: How many files and what is file format and compression in the single partition, what is partitioned column? — Nov 20 '18 at 17:15
Partitioned column is loc and compression is ZLIB with ORC File format. There are 40-50 partitions. — Nov 21 '18 at 18:24
@Varshini check also please how many files per partition and their size — Nov 21 '18 at 18:33
There are 12 files in almost all the partitions . Size of every partition would be 200-500 MB and total size of the table is 10GB . — Nov 21 '18 at 19:25

搜尋此網誌

Wsrtjtyk

Hive TEZ is taking very long time to run the query

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

這個網誌中的熱門文章

Academy of Television Arts & Sciences

MGP Nordic

Xamarin.form Move up view when keyboard appear

Hive TEZ is taking very long time to run the query

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

這個網誌中的熱門文章

Academy of Television Arts & Sciences

MGP Nordic

Xamarin.form Move up view when keyboard appear

1 Answer
1

1 Answer
1

1 Answer
1