Hive TEZ is taking very long time to run the query
I'm kinda of new to Hive and Hadoop . I have a query which is taking 10 minutes to complete the query .
Size of the data is 10GB
Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE
Partition and Bucketing is done in the table .
How can I improve the below query .
select * fromtbl1 where clmn='Abdul' and loc='IND' and TO_UNIX_TIMESTAMP(ts) > (UNIX_TIMESTAMP() - 5*60*60);
set hive.vectorized.execution.reduce.enabled=true;
set hive.tez.container.size=8192;
set hive.fetch.task.conversion = none;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.fetch.task.conversion=none;
-----------+--+
| Explain |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| Plan not optimized by CBO. |
| |
| Stage-0 |
| Fetch Operator |
| limit:-1 |
| Stage-1 |
| Map 1 |
| File Output Operator [FS_2973] |
| compressed:false |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} |
| Select Operator [SEL_2972] |
| outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"] |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| Filter Operator [FIL_2971] |
| predicate:((section = 'xysaa') and (to_unix_timestamp(ts) > (unix_timestamp() - 18000))) (type: boolean) |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| TableScan [TS_2970] |
| ACID table:true |
| alias:pp |
| Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE |
| |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
None of the parameters helped us to resolve the query in shorter period of time .
hive mapreduce apache-tez
add a comment |
I'm kinda of new to Hive and Hadoop . I have a query which is taking 10 minutes to complete the query .
Size of the data is 10GB
Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE
Partition and Bucketing is done in the table .
How can I improve the below query .
select * fromtbl1 where clmn='Abdul' and loc='IND' and TO_UNIX_TIMESTAMP(ts) > (UNIX_TIMESTAMP() - 5*60*60);
set hive.vectorized.execution.reduce.enabled=true;
set hive.tez.container.size=8192;
set hive.fetch.task.conversion = none;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.fetch.task.conversion=none;
-----------+--+
| Explain |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| Plan not optimized by CBO. |
| |
| Stage-0 |
| Fetch Operator |
| limit:-1 |
| Stage-1 |
| Map 1 |
| File Output Operator [FS_2973] |
| compressed:false |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} |
| Select Operator [SEL_2972] |
| outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"] |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| Filter Operator [FIL_2971] |
| predicate:((section = 'xysaa') and (to_unix_timestamp(ts) > (unix_timestamp() - 18000))) (type: boolean) |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| TableScan [TS_2970] |
| ACID table:true |
| alias:pp |
| Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE |
| |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
None of the parameters helped us to resolve the query in shorter period of time .
hive mapreduce apache-tez
add a comment |
I'm kinda of new to Hive and Hadoop . I have a query which is taking 10 minutes to complete the query .
Size of the data is 10GB
Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE
Partition and Bucketing is done in the table .
How can I improve the below query .
select * fromtbl1 where clmn='Abdul' and loc='IND' and TO_UNIX_TIMESTAMP(ts) > (UNIX_TIMESTAMP() - 5*60*60);
set hive.vectorized.execution.reduce.enabled=true;
set hive.tez.container.size=8192;
set hive.fetch.task.conversion = none;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.fetch.task.conversion=none;
-----------+--+
| Explain |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| Plan not optimized by CBO. |
| |
| Stage-0 |
| Fetch Operator |
| limit:-1 |
| Stage-1 |
| Map 1 |
| File Output Operator [FS_2973] |
| compressed:false |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} |
| Select Operator [SEL_2972] |
| outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"] |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| Filter Operator [FIL_2971] |
| predicate:((section = 'xysaa') and (to_unix_timestamp(ts) > (unix_timestamp() - 18000))) (type: boolean) |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| TableScan [TS_2970] |
| ACID table:true |
| alias:pp |
| Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE |
| |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
None of the parameters helped us to resolve the query in shorter period of time .
hive mapreduce apache-tez
I'm kinda of new to Hive and Hadoop . I have a query which is taking 10 minutes to complete the query .
Size of the data is 10GB
Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE
Partition and Bucketing is done in the table .
How can I improve the below query .
select * fromtbl1 where clmn='Abdul' and loc='IND' and TO_UNIX_TIMESTAMP(ts) > (UNIX_TIMESTAMP() - 5*60*60);
set hive.vectorized.execution.reduce.enabled=true;
set hive.tez.container.size=8192;
set hive.fetch.task.conversion = none;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.fetch.task.conversion=none;
-----------+--+
| Explain |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| Plan not optimized by CBO. |
| |
| Stage-0 |
| Fetch Operator |
| limit:-1 |
| Stage-1 |
| Map 1 |
| File Output Operator [FS_2973] |
| compressed:false |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} |
| Select Operator [SEL_2972] |
| outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"] |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| Filter Operator [FIL_2971] |
| predicate:((section = 'xysaa') and (to_unix_timestamp(ts) > (unix_timestamp() - 18000))) (type: boolean) |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| TableScan [TS_2970] |
| ACID table:true |
| alias:pp |
| Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE |
| |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
None of the parameters helped us to resolve the query in shorter period of time .
hive mapreduce apache-tez
hive mapreduce apache-tez
asked Nov 18 '18 at 19:51
VarshiniVarshini
234
234
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
According to the plan, query runs on mapper, vectorizing is not enabled. Try this:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled=true;
Tune mapper parallelism:
set tez.grouping.max-size=67108864;
set tez.grouping.min-size=32000000;
Play with these settings to increase the number of mappers running. Ideally it should run without this setting:
set hive.tez.container.size=8192;
One more recommendation is to replace unix_timestamp()
with UNIX_TIMESTAMP(current_timestamp)
. This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favor of CURRENT_TIMESTAMP
constant.
(UNIX_TIMESTAMP(current_timestamp) - 5*60*60)
Also your files are very small. the size of partition is 200-500, 12 files per partition, 20-50Mb is the file size. Fortunately it is ORC and you can concatenate files using ALTER TABLE CONCATENATE COMMAND.
12 files is not a big deal and you probably will not notice an improvement when querying single partition.
See also this answer: https://stackoverflow.com/a/48487306/2700344
No Improvement Mate . Still the same . I was wondering why it is taking 10 minutes to complete the query when the data is huge in TB also taking same time to return the results . How can I tune this query faster .
– Varshini
Nov 20 '18 at 16:35
@Varshini Please answer these questions: How many files and what is file format and compression in the single partition, what is partitioned column?
– leftjoin
Nov 20 '18 at 17:15
Partitioned column is loc and compression is ZLIB with ORC File format. There are 40-50 partitions.
– Varshini
Nov 21 '18 at 18:24
@Varshini check also please how many files per partition and their size
– leftjoin
Nov 21 '18 at 18:33
There are 12 files in almost all the partitions . Size of every partition would be 200-500 MB and total size of the table is 10GB .
– Varshini
Nov 21 '18 at 19:25
|
show 8 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53364832%2fhive-tez-is-taking-very-long-time-to-run-the-query%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
According to the plan, query runs on mapper, vectorizing is not enabled. Try this:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled=true;
Tune mapper parallelism:
set tez.grouping.max-size=67108864;
set tez.grouping.min-size=32000000;
Play with these settings to increase the number of mappers running. Ideally it should run without this setting:
set hive.tez.container.size=8192;
One more recommendation is to replace unix_timestamp()
with UNIX_TIMESTAMP(current_timestamp)
. This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favor of CURRENT_TIMESTAMP
constant.
(UNIX_TIMESTAMP(current_timestamp) - 5*60*60)
Also your files are very small. the size of partition is 200-500, 12 files per partition, 20-50Mb is the file size. Fortunately it is ORC and you can concatenate files using ALTER TABLE CONCATENATE COMMAND.
12 files is not a big deal and you probably will not notice an improvement when querying single partition.
See also this answer: https://stackoverflow.com/a/48487306/2700344
No Improvement Mate . Still the same . I was wondering why it is taking 10 minutes to complete the query when the data is huge in TB also taking same time to return the results . How can I tune this query faster .
– Varshini
Nov 20 '18 at 16:35
@Varshini Please answer these questions: How many files and what is file format and compression in the single partition, what is partitioned column?
– leftjoin
Nov 20 '18 at 17:15
Partitioned column is loc and compression is ZLIB with ORC File format. There are 40-50 partitions.
– Varshini
Nov 21 '18 at 18:24
@Varshini check also please how many files per partition and their size
– leftjoin
Nov 21 '18 at 18:33
There are 12 files in almost all the partitions . Size of every partition would be 200-500 MB and total size of the table is 10GB .
– Varshini
Nov 21 '18 at 19:25
|
show 8 more comments
According to the plan, query runs on mapper, vectorizing is not enabled. Try this:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled=true;
Tune mapper parallelism:
set tez.grouping.max-size=67108864;
set tez.grouping.min-size=32000000;
Play with these settings to increase the number of mappers running. Ideally it should run without this setting:
set hive.tez.container.size=8192;
One more recommendation is to replace unix_timestamp()
with UNIX_TIMESTAMP(current_timestamp)
. This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favor of CURRENT_TIMESTAMP
constant.
(UNIX_TIMESTAMP(current_timestamp) - 5*60*60)
Also your files are very small. the size of partition is 200-500, 12 files per partition, 20-50Mb is the file size. Fortunately it is ORC and you can concatenate files using ALTER TABLE CONCATENATE COMMAND.
12 files is not a big deal and you probably will not notice an improvement when querying single partition.
See also this answer: https://stackoverflow.com/a/48487306/2700344
No Improvement Mate . Still the same . I was wondering why it is taking 10 minutes to complete the query when the data is huge in TB also taking same time to return the results . How can I tune this query faster .
– Varshini
Nov 20 '18 at 16:35
@Varshini Please answer these questions: How many files and what is file format and compression in the single partition, what is partitioned column?
– leftjoin
Nov 20 '18 at 17:15
Partitioned column is loc and compression is ZLIB with ORC File format. There are 40-50 partitions.
– Varshini
Nov 21 '18 at 18:24
@Varshini check also please how many files per partition and their size
– leftjoin
Nov 21 '18 at 18:33
There are 12 files in almost all the partitions . Size of every partition would be 200-500 MB and total size of the table is 10GB .
– Varshini
Nov 21 '18 at 19:25
|
show 8 more comments
According to the plan, query runs on mapper, vectorizing is not enabled. Try this:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled=true;
Tune mapper parallelism:
set tez.grouping.max-size=67108864;
set tez.grouping.min-size=32000000;
Play with these settings to increase the number of mappers running. Ideally it should run without this setting:
set hive.tez.container.size=8192;
One more recommendation is to replace unix_timestamp()
with UNIX_TIMESTAMP(current_timestamp)
. This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favor of CURRENT_TIMESTAMP
constant.
(UNIX_TIMESTAMP(current_timestamp) - 5*60*60)
Also your files are very small. the size of partition is 200-500, 12 files per partition, 20-50Mb is the file size. Fortunately it is ORC and you can concatenate files using ALTER TABLE CONCATENATE COMMAND.
12 files is not a big deal and you probably will not notice an improvement when querying single partition.
See also this answer: https://stackoverflow.com/a/48487306/2700344
According to the plan, query runs on mapper, vectorizing is not enabled. Try this:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled=true;
Tune mapper parallelism:
set tez.grouping.max-size=67108864;
set tez.grouping.min-size=32000000;
Play with these settings to increase the number of mappers running. Ideally it should run without this setting:
set hive.tez.container.size=8192;
One more recommendation is to replace unix_timestamp()
with UNIX_TIMESTAMP(current_timestamp)
. This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favor of CURRENT_TIMESTAMP
constant.
(UNIX_TIMESTAMP(current_timestamp) - 5*60*60)
Also your files are very small. the size of partition is 200-500, 12 files per partition, 20-50Mb is the file size. Fortunately it is ORC and you can concatenate files using ALTER TABLE CONCATENATE COMMAND.
12 files is not a big deal and you probably will not notice an improvement when querying single partition.
See also this answer: https://stackoverflow.com/a/48487306/2700344
edited Nov 21 '18 at 20:07
answered Nov 19 '18 at 19:12
leftjoinleftjoin
8,71922151
8,71922151
No Improvement Mate . Still the same . I was wondering why it is taking 10 minutes to complete the query when the data is huge in TB also taking same time to return the results . How can I tune this query faster .
– Varshini
Nov 20 '18 at 16:35
@Varshini Please answer these questions: How many files and what is file format and compression in the single partition, what is partitioned column?
– leftjoin
Nov 20 '18 at 17:15
Partitioned column is loc and compression is ZLIB with ORC File format. There are 40-50 partitions.
– Varshini
Nov 21 '18 at 18:24
@Varshini check also please how many files per partition and their size
– leftjoin
Nov 21 '18 at 18:33
There are 12 files in almost all the partitions . Size of every partition would be 200-500 MB and total size of the table is 10GB .
– Varshini
Nov 21 '18 at 19:25
|
show 8 more comments
No Improvement Mate . Still the same . I was wondering why it is taking 10 minutes to complete the query when the data is huge in TB also taking same time to return the results . How can I tune this query faster .
– Varshini
Nov 20 '18 at 16:35
@Varshini Please answer these questions: How many files and what is file format and compression in the single partition, what is partitioned column?
– leftjoin
Nov 20 '18 at 17:15
Partitioned column is loc and compression is ZLIB with ORC File format. There are 40-50 partitions.
– Varshini
Nov 21 '18 at 18:24
@Varshini check also please how many files per partition and their size
– leftjoin
Nov 21 '18 at 18:33
There are 12 files in almost all the partitions . Size of every partition would be 200-500 MB and total size of the table is 10GB .
– Varshini
Nov 21 '18 at 19:25
No Improvement Mate . Still the same . I was wondering why it is taking 10 minutes to complete the query when the data is huge in TB also taking same time to return the results . How can I tune this query faster .
– Varshini
Nov 20 '18 at 16:35
No Improvement Mate . Still the same . I was wondering why it is taking 10 minutes to complete the query when the data is huge in TB also taking same time to return the results . How can I tune this query faster .
– Varshini
Nov 20 '18 at 16:35
@Varshini Please answer these questions: How many files and what is file format and compression in the single partition, what is partitioned column?
– leftjoin
Nov 20 '18 at 17:15
@Varshini Please answer these questions: How many files and what is file format and compression in the single partition, what is partitioned column?
– leftjoin
Nov 20 '18 at 17:15
Partitioned column is loc and compression is ZLIB with ORC File format. There are 40-50 partitions.
– Varshini
Nov 21 '18 at 18:24
Partitioned column is loc and compression is ZLIB with ORC File format. There are 40-50 partitions.
– Varshini
Nov 21 '18 at 18:24
@Varshini check also please how many files per partition and their size
– leftjoin
Nov 21 '18 at 18:33
@Varshini check also please how many files per partition and their size
– leftjoin
Nov 21 '18 at 18:33
There are 12 files in almost all the partitions . Size of every partition would be 200-500 MB and total size of the table is 10GB .
– Varshini
Nov 21 '18 at 19:25
There are 12 files in almost all the partitions . Size of every partition would be 200-500 MB and total size of the table is 10GB .
– Varshini
Nov 21 '18 at 19:25
|
show 8 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53364832%2fhive-tez-is-taking-very-long-time-to-run-the-query%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown