How avoid cross join in hive?











up vote
3
down vote

favorite
2












I have two tables. one includes 1 million records, the other includes 20 million records.





table 1
value
(1, 1)
(2, 2)
(3, 3)
(4, 4)
(5, 4)
....

table 2
value
(55, 11)
(33, 22)
(44, 66)
(22, 11)
(11, 33)
....



I need to use the value in tables 1 to multiply by the value in table 2, get the rank of the result, and get top 5 in the rank.
their result would be like:





value from table 1, top 5 for each value in table 1
(1, 1), 1*44 + 1*66 = 110
(1, 1), 1*55 + 1*11 = 66
(1, 1), 1*33 + 1*22 = 55
(1, 1), 1*11 + 1*33 = 44
(1, 1), 1*22 + 1* 11 = 33
.....



I tried to use cross join in hive. but I always get a failure due to the table is too large.










share|improve this question




























    up vote
    3
    down vote

    favorite
    2












    I have two tables. one includes 1 million records, the other includes 20 million records.





    table 1
    value
    (1, 1)
    (2, 2)
    (3, 3)
    (4, 4)
    (5, 4)
    ....

    table 2
    value
    (55, 11)
    (33, 22)
    (44, 66)
    (22, 11)
    (11, 33)
    ....



    I need to use the value in tables 1 to multiply by the value in table 2, get the rank of the result, and get top 5 in the rank.
    their result would be like:





    value from table 1, top 5 for each value in table 1
    (1, 1), 1*44 + 1*66 = 110
    (1, 1), 1*55 + 1*11 = 66
    (1, 1), 1*33 + 1*22 = 55
    (1, 1), 1*11 + 1*33 = 44
    (1, 1), 1*22 + 1* 11 = 33
    .....



    I tried to use cross join in hive. but I always get a failure due to the table is too large.










    share|improve this question


























      up vote
      3
      down vote

      favorite
      2









      up vote
      3
      down vote

      favorite
      2






      2





      I have two tables. one includes 1 million records, the other includes 20 million records.





      table 1
      value
      (1, 1)
      (2, 2)
      (3, 3)
      (4, 4)
      (5, 4)
      ....

      table 2
      value
      (55, 11)
      (33, 22)
      (44, 66)
      (22, 11)
      (11, 33)
      ....



      I need to use the value in tables 1 to multiply by the value in table 2, get the rank of the result, and get top 5 in the rank.
      their result would be like:





      value from table 1, top 5 for each value in table 1
      (1, 1), 1*44 + 1*66 = 110
      (1, 1), 1*55 + 1*11 = 66
      (1, 1), 1*33 + 1*22 = 55
      (1, 1), 1*11 + 1*33 = 44
      (1, 1), 1*22 + 1* 11 = 33
      .....



      I tried to use cross join in hive. but I always get a failure due to the table is too large.










      share|improve this question















      I have two tables. one includes 1 million records, the other includes 20 million records.





      table 1
      value
      (1, 1)
      (2, 2)
      (3, 3)
      (4, 4)
      (5, 4)
      ....

      table 2
      value
      (55, 11)
      (33, 22)
      (44, 66)
      (22, 11)
      (11, 33)
      ....



      I need to use the value in tables 1 to multiply by the value in table 2, get the rank of the result, and get top 5 in the rank.
      their result would be like:





      value from table 1, top 5 for each value in table 1
      (1, 1), 1*44 + 1*66 = 110
      (1, 1), 1*55 + 1*11 = 66
      (1, 1), 1*33 + 1*22 = 55
      (1, 1), 1*11 + 1*33 = 44
      (1, 1), 1*22 + 1* 11 = 33
      .....



      I tried to use cross join in hive. but I always get a failure due to the table is too large.







      hive cross-join






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 7 at 9:37

























      asked Nov 7 at 7:07









      vito yan

      417




      417
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          select top 5 from table 2 first, then do a cross join with first table. This will be the same as cross join two tables and taking top5 after cross join, but the number of rows joined in the first case will be much less. Cross join with small 5 rows dataset will be transformed to map-join and executed as fast as table1 full scan.



          Look at the below demo. Cross join was transformed to map join. Note "Map Join Operator" in the plan and this warning: "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product":



          hive> set hive.cbo.enable=true;
          hive> set hive.compute.query.using.stats=true;
          hive> set hive.execution.engine=tez;
          hive> set hive.auto.convert.join.noconditionaltask=false;
          hive> set hive.auto.convert.join=true;
          hive> set hive.vectorized.execution.enabled=true;
          hive> set hive.vectorized.execution.reduce.enabled=true;
          hive> set hive.vectorized.execution.mapjoin.native.enabled=true;
          hive> set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
          hive>
          > explain
          > with table1 as (
          > select stack(5,1,2,3,4,5) as id
          > ),
          > table2 as
          > (select t2.id
          > from (select t2.id, dense_rank() over(order by id desc) rnk
          > from (select stack(11,55,33,44,22,11,1,2,3,4,5,6) as id) t2
          > )t2
          > where t2.rnk<6
          > )
          > select t1.id, t1.id*t2.id
          > from table1 t1
          > cross join table2 t2;
          Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product
          OK
          Plan not optimized by CBO.

          Vertex dependency in root stage
          Map 1 <- Reducer 3 (BROADCAST_EDGE)
          Reducer 3 <- Map 2 (SIMPLE_EDGE)

          Stage-0
          Fetch Operator
          limit:-1
          Stage-1
          Map 1
          File Output Operator [FS_17]
          compressed:false
          Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}
          Select Operator [SEL_16]
          outputColumnNames:["_col0","_col1"]
          Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          Map Join Operator [MAPJOIN_19]
          | condition map:[{"":"Inner Join 0 to 1"}]
          | HybridGraceHashJoin:true
          | keys:{}
          | outputColumnNames:["_col0","_col1"]
          | Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          |<-Reducer 3 [BROADCAST_EDGE]
          | Reduce Output Operator [RS_14]
          | sort order:
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | value expressions:_col0 (type: int)
          | Select Operator [SEL_9]
          | outputColumnNames:["_col0"]
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | Filter Operator [FIL_18]
          | predicate:(dense_rank_window_0 < 6) (type: boolean)
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | PTF Operator [PTF_8]
          | Function definitions:[{"Input definition":{"type:":"WINDOWING"}},{"partition by:":"0","name:":"windowingtablefunction","order by:":"_col0(DESC)"}]
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | Select Operator [SEL_7]
          | | outputColumnNames:["_col0"]
          | | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | |<-Map 2 [SIMPLE_EDGE]
          | Reduce Output Operator [RS_6]
          | key expressions:0 (type: int), col0 (type: int)
          | Map-reduce partition columns:0 (type: int)
          | sort order:+-
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | UDTF Operator [UDTF_5]
          | function name:stack
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | Select Operator [SEL_4]
          | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | TableScan [TS_3]
          | alias:_dummy_table
          | Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE
          |<-UDTF Operator [UDTF_2]
          function name:stack
          Statistics:Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
          Select Operator [SEL_1]
          outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
          Statistics:Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
          TableScan [TS_0]
          alias:_dummy_table
          Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE

          Time taken: 0.199 seconds, Fetched: 66 row(s)


          Just replace stacks in my demo with your tables.






          share|improve this answer





















          • thank you very much. i am so sorry that i did not describe my problem. the value in table 2 is not ordered. i will update my problem. but your answer does resolved my origin problem.
            – vito yan
            Nov 7 at 9:31










          • @vitoyan So, table2 has two columns, right? Then use rank by sum of these columns. 1*44 + 1*66 = 1*(44+66) = 110. Use the same dense_rank() over (order by t2.col1+t2.col2 desc) rnk
            – leftjoin
            Nov 7 at 9:47










          • @vitoyan Please vote or accept if you are satisfied with my answer
            – leftjoin
            Nov 7 at 9:50












          • based on my problem describe, you provided a perfect solution. i will vote and accept your answer and post a new question. hope you can still provide a perfect solution. thank you verymuch.
            – vito yan
            Nov 7 at 9:56


















          up vote
          0
          down vote













          Please go through this link it may help:
          https://mapr.com/support/s/article/Optimizing-Hive-cross-joins?language=en_US

          Hive supports map-joins only for inner, left and right outer joins. To make sure a product join happens with a map joins, we may have to fake an inner join to do product join. We have to change the split size to say 5 MB and write the query like below. The cross join of the 1,2 in sub queries will now be distributed across all 60 mappers.


          set mapreduce.input.fileinputformat.split.maxsize=5000000



          with Table1 AS
          (Select value, 1 as key from A),
          Table2 AS
          (Select value,1 as key from B)
          Select Table1.A1,
          min(Table1.value * Table2.value)
          from Table1 inner join Table2
          on (Table1.key=Table2.key)
          group by Table1.A1


          Kindly refer to the link for reference





          share|improve this answer





















          • Actually I have demonstrated that Hive transforms Cross join into map join as well. This warning message "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product" says the same, also please have a look at the mapjoin operator in the plan
            – leftjoin
            Nov 7 at 9:52













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53184889%2fhow-avoid-cross-join-in-hive%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote



          accepted










          select top 5 from table 2 first, then do a cross join with first table. This will be the same as cross join two tables and taking top5 after cross join, but the number of rows joined in the first case will be much less. Cross join with small 5 rows dataset will be transformed to map-join and executed as fast as table1 full scan.



          Look at the below demo. Cross join was transformed to map join. Note "Map Join Operator" in the plan and this warning: "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product":



          hive> set hive.cbo.enable=true;
          hive> set hive.compute.query.using.stats=true;
          hive> set hive.execution.engine=tez;
          hive> set hive.auto.convert.join.noconditionaltask=false;
          hive> set hive.auto.convert.join=true;
          hive> set hive.vectorized.execution.enabled=true;
          hive> set hive.vectorized.execution.reduce.enabled=true;
          hive> set hive.vectorized.execution.mapjoin.native.enabled=true;
          hive> set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
          hive>
          > explain
          > with table1 as (
          > select stack(5,1,2,3,4,5) as id
          > ),
          > table2 as
          > (select t2.id
          > from (select t2.id, dense_rank() over(order by id desc) rnk
          > from (select stack(11,55,33,44,22,11,1,2,3,4,5,6) as id) t2
          > )t2
          > where t2.rnk<6
          > )
          > select t1.id, t1.id*t2.id
          > from table1 t1
          > cross join table2 t2;
          Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product
          OK
          Plan not optimized by CBO.

          Vertex dependency in root stage
          Map 1 <- Reducer 3 (BROADCAST_EDGE)
          Reducer 3 <- Map 2 (SIMPLE_EDGE)

          Stage-0
          Fetch Operator
          limit:-1
          Stage-1
          Map 1
          File Output Operator [FS_17]
          compressed:false
          Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}
          Select Operator [SEL_16]
          outputColumnNames:["_col0","_col1"]
          Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          Map Join Operator [MAPJOIN_19]
          | condition map:[{"":"Inner Join 0 to 1"}]
          | HybridGraceHashJoin:true
          | keys:{}
          | outputColumnNames:["_col0","_col1"]
          | Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          |<-Reducer 3 [BROADCAST_EDGE]
          | Reduce Output Operator [RS_14]
          | sort order:
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | value expressions:_col0 (type: int)
          | Select Operator [SEL_9]
          | outputColumnNames:["_col0"]
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | Filter Operator [FIL_18]
          | predicate:(dense_rank_window_0 < 6) (type: boolean)
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | PTF Operator [PTF_8]
          | Function definitions:[{"Input definition":{"type:":"WINDOWING"}},{"partition by:":"0","name:":"windowingtablefunction","order by:":"_col0(DESC)"}]
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | Select Operator [SEL_7]
          | | outputColumnNames:["_col0"]
          | | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | |<-Map 2 [SIMPLE_EDGE]
          | Reduce Output Operator [RS_6]
          | key expressions:0 (type: int), col0 (type: int)
          | Map-reduce partition columns:0 (type: int)
          | sort order:+-
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | UDTF Operator [UDTF_5]
          | function name:stack
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | Select Operator [SEL_4]
          | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | TableScan [TS_3]
          | alias:_dummy_table
          | Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE
          |<-UDTF Operator [UDTF_2]
          function name:stack
          Statistics:Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
          Select Operator [SEL_1]
          outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
          Statistics:Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
          TableScan [TS_0]
          alias:_dummy_table
          Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE

          Time taken: 0.199 seconds, Fetched: 66 row(s)


          Just replace stacks in my demo with your tables.






          share|improve this answer





















          • thank you very much. i am so sorry that i did not describe my problem. the value in table 2 is not ordered. i will update my problem. but your answer does resolved my origin problem.
            – vito yan
            Nov 7 at 9:31










          • @vitoyan So, table2 has two columns, right? Then use rank by sum of these columns. 1*44 + 1*66 = 1*(44+66) = 110. Use the same dense_rank() over (order by t2.col1+t2.col2 desc) rnk
            – leftjoin
            Nov 7 at 9:47










          • @vitoyan Please vote or accept if you are satisfied with my answer
            – leftjoin
            Nov 7 at 9:50












          • based on my problem describe, you provided a perfect solution. i will vote and accept your answer and post a new question. hope you can still provide a perfect solution. thank you verymuch.
            – vito yan
            Nov 7 at 9:56















          up vote
          2
          down vote



          accepted










          select top 5 from table 2 first, then do a cross join with first table. This will be the same as cross join two tables and taking top5 after cross join, but the number of rows joined in the first case will be much less. Cross join with small 5 rows dataset will be transformed to map-join and executed as fast as table1 full scan.



          Look at the below demo. Cross join was transformed to map join. Note "Map Join Operator" in the plan and this warning: "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product":



          hive> set hive.cbo.enable=true;
          hive> set hive.compute.query.using.stats=true;
          hive> set hive.execution.engine=tez;
          hive> set hive.auto.convert.join.noconditionaltask=false;
          hive> set hive.auto.convert.join=true;
          hive> set hive.vectorized.execution.enabled=true;
          hive> set hive.vectorized.execution.reduce.enabled=true;
          hive> set hive.vectorized.execution.mapjoin.native.enabled=true;
          hive> set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
          hive>
          > explain
          > with table1 as (
          > select stack(5,1,2,3,4,5) as id
          > ),
          > table2 as
          > (select t2.id
          > from (select t2.id, dense_rank() over(order by id desc) rnk
          > from (select stack(11,55,33,44,22,11,1,2,3,4,5,6) as id) t2
          > )t2
          > where t2.rnk<6
          > )
          > select t1.id, t1.id*t2.id
          > from table1 t1
          > cross join table2 t2;
          Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product
          OK
          Plan not optimized by CBO.

          Vertex dependency in root stage
          Map 1 <- Reducer 3 (BROADCAST_EDGE)
          Reducer 3 <- Map 2 (SIMPLE_EDGE)

          Stage-0
          Fetch Operator
          limit:-1
          Stage-1
          Map 1
          File Output Operator [FS_17]
          compressed:false
          Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}
          Select Operator [SEL_16]
          outputColumnNames:["_col0","_col1"]
          Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          Map Join Operator [MAPJOIN_19]
          | condition map:[{"":"Inner Join 0 to 1"}]
          | HybridGraceHashJoin:true
          | keys:{}
          | outputColumnNames:["_col0","_col1"]
          | Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          |<-Reducer 3 [BROADCAST_EDGE]
          | Reduce Output Operator [RS_14]
          | sort order:
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | value expressions:_col0 (type: int)
          | Select Operator [SEL_9]
          | outputColumnNames:["_col0"]
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | Filter Operator [FIL_18]
          | predicate:(dense_rank_window_0 < 6) (type: boolean)
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | PTF Operator [PTF_8]
          | Function definitions:[{"Input definition":{"type:":"WINDOWING"}},{"partition by:":"0","name:":"windowingtablefunction","order by:":"_col0(DESC)"}]
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | Select Operator [SEL_7]
          | | outputColumnNames:["_col0"]
          | | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | |<-Map 2 [SIMPLE_EDGE]
          | Reduce Output Operator [RS_6]
          | key expressions:0 (type: int), col0 (type: int)
          | Map-reduce partition columns:0 (type: int)
          | sort order:+-
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | UDTF Operator [UDTF_5]
          | function name:stack
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | Select Operator [SEL_4]
          | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | TableScan [TS_3]
          | alias:_dummy_table
          | Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE
          |<-UDTF Operator [UDTF_2]
          function name:stack
          Statistics:Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
          Select Operator [SEL_1]
          outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
          Statistics:Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
          TableScan [TS_0]
          alias:_dummy_table
          Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE

          Time taken: 0.199 seconds, Fetched: 66 row(s)


          Just replace stacks in my demo with your tables.






          share|improve this answer





















          • thank you very much. i am so sorry that i did not describe my problem. the value in table 2 is not ordered. i will update my problem. but your answer does resolved my origin problem.
            – vito yan
            Nov 7 at 9:31










          • @vitoyan So, table2 has two columns, right? Then use rank by sum of these columns. 1*44 + 1*66 = 1*(44+66) = 110. Use the same dense_rank() over (order by t2.col1+t2.col2 desc) rnk
            – leftjoin
            Nov 7 at 9:47










          • @vitoyan Please vote or accept if you are satisfied with my answer
            – leftjoin
            Nov 7 at 9:50












          • based on my problem describe, you provided a perfect solution. i will vote and accept your answer and post a new question. hope you can still provide a perfect solution. thank you verymuch.
            – vito yan
            Nov 7 at 9:56













          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted






          select top 5 from table 2 first, then do a cross join with first table. This will be the same as cross join two tables and taking top5 after cross join, but the number of rows joined in the first case will be much less. Cross join with small 5 rows dataset will be transformed to map-join and executed as fast as table1 full scan.



          Look at the below demo. Cross join was transformed to map join. Note "Map Join Operator" in the plan and this warning: "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product":



          hive> set hive.cbo.enable=true;
          hive> set hive.compute.query.using.stats=true;
          hive> set hive.execution.engine=tez;
          hive> set hive.auto.convert.join.noconditionaltask=false;
          hive> set hive.auto.convert.join=true;
          hive> set hive.vectorized.execution.enabled=true;
          hive> set hive.vectorized.execution.reduce.enabled=true;
          hive> set hive.vectorized.execution.mapjoin.native.enabled=true;
          hive> set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
          hive>
          > explain
          > with table1 as (
          > select stack(5,1,2,3,4,5) as id
          > ),
          > table2 as
          > (select t2.id
          > from (select t2.id, dense_rank() over(order by id desc) rnk
          > from (select stack(11,55,33,44,22,11,1,2,3,4,5,6) as id) t2
          > )t2
          > where t2.rnk<6
          > )
          > select t1.id, t1.id*t2.id
          > from table1 t1
          > cross join table2 t2;
          Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product
          OK
          Plan not optimized by CBO.

          Vertex dependency in root stage
          Map 1 <- Reducer 3 (BROADCAST_EDGE)
          Reducer 3 <- Map 2 (SIMPLE_EDGE)

          Stage-0
          Fetch Operator
          limit:-1
          Stage-1
          Map 1
          File Output Operator [FS_17]
          compressed:false
          Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}
          Select Operator [SEL_16]
          outputColumnNames:["_col0","_col1"]
          Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          Map Join Operator [MAPJOIN_19]
          | condition map:[{"":"Inner Join 0 to 1"}]
          | HybridGraceHashJoin:true
          | keys:{}
          | outputColumnNames:["_col0","_col1"]
          | Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          |<-Reducer 3 [BROADCAST_EDGE]
          | Reduce Output Operator [RS_14]
          | sort order:
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | value expressions:_col0 (type: int)
          | Select Operator [SEL_9]
          | outputColumnNames:["_col0"]
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | Filter Operator [FIL_18]
          | predicate:(dense_rank_window_0 < 6) (type: boolean)
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | PTF Operator [PTF_8]
          | Function definitions:[{"Input definition":{"type:":"WINDOWING"}},{"partition by:":"0","name:":"windowingtablefunction","order by:":"_col0(DESC)"}]
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | Select Operator [SEL_7]
          | | outputColumnNames:["_col0"]
          | | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | |<-Map 2 [SIMPLE_EDGE]
          | Reduce Output Operator [RS_6]
          | key expressions:0 (type: int), col0 (type: int)
          | Map-reduce partition columns:0 (type: int)
          | sort order:+-
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | UDTF Operator [UDTF_5]
          | function name:stack
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | Select Operator [SEL_4]
          | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | TableScan [TS_3]
          | alias:_dummy_table
          | Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE
          |<-UDTF Operator [UDTF_2]
          function name:stack
          Statistics:Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
          Select Operator [SEL_1]
          outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
          Statistics:Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
          TableScan [TS_0]
          alias:_dummy_table
          Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE

          Time taken: 0.199 seconds, Fetched: 66 row(s)


          Just replace stacks in my demo with your tables.






          share|improve this answer












          select top 5 from table 2 first, then do a cross join with first table. This will be the same as cross join two tables and taking top5 after cross join, but the number of rows joined in the first case will be much less. Cross join with small 5 rows dataset will be transformed to map-join and executed as fast as table1 full scan.



          Look at the below demo. Cross join was transformed to map join. Note "Map Join Operator" in the plan and this warning: "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product":



          hive> set hive.cbo.enable=true;
          hive> set hive.compute.query.using.stats=true;
          hive> set hive.execution.engine=tez;
          hive> set hive.auto.convert.join.noconditionaltask=false;
          hive> set hive.auto.convert.join=true;
          hive> set hive.vectorized.execution.enabled=true;
          hive> set hive.vectorized.execution.reduce.enabled=true;
          hive> set hive.vectorized.execution.mapjoin.native.enabled=true;
          hive> set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
          hive>
          > explain
          > with table1 as (
          > select stack(5,1,2,3,4,5) as id
          > ),
          > table2 as
          > (select t2.id
          > from (select t2.id, dense_rank() over(order by id desc) rnk
          > from (select stack(11,55,33,44,22,11,1,2,3,4,5,6) as id) t2
          > )t2
          > where t2.rnk<6
          > )
          > select t1.id, t1.id*t2.id
          > from table1 t1
          > cross join table2 t2;
          Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product
          OK
          Plan not optimized by CBO.

          Vertex dependency in root stage
          Map 1 <- Reducer 3 (BROADCAST_EDGE)
          Reducer 3 <- Map 2 (SIMPLE_EDGE)

          Stage-0
          Fetch Operator
          limit:-1
          Stage-1
          Map 1
          File Output Operator [FS_17]
          compressed:false
          Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}
          Select Operator [SEL_16]
          outputColumnNames:["_col0","_col1"]
          Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          Map Join Operator [MAPJOIN_19]
          | condition map:[{"":"Inner Join 0 to 1"}]
          | HybridGraceHashJoin:true
          | keys:{}
          | outputColumnNames:["_col0","_col1"]
          | Statistics:Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          |<-Reducer 3 [BROADCAST_EDGE]
          | Reduce Output Operator [RS_14]
          | sort order:
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | value expressions:_col0 (type: int)
          | Select Operator [SEL_9]
          | outputColumnNames:["_col0"]
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | Filter Operator [FIL_18]
          | predicate:(dense_rank_window_0 < 6) (type: boolean)
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | PTF Operator [PTF_8]
          | Function definitions:[{"Input definition":{"type:":"WINDOWING"}},{"partition by:":"0","name:":"windowingtablefunction","order by:":"_col0(DESC)"}]
          | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | Select Operator [SEL_7]
          | | outputColumnNames:["_col0"]
          | | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
          | |<-Map 2 [SIMPLE_EDGE]
          | Reduce Output Operator [RS_6]
          | key expressions:0 (type: int), col0 (type: int)
          | Map-reduce partition columns:0 (type: int)
          | sort order:+-
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | UDTF Operator [UDTF_5]
          | function name:stack
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | Select Operator [SEL_4]
          | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
          | Statistics:Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
          | TableScan [TS_3]
          | alias:_dummy_table
          | Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE
          |<-UDTF Operator [UDTF_2]
          function name:stack
          Statistics:Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
          Select Operator [SEL_1]
          outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
          Statistics:Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
          TableScan [TS_0]
          alias:_dummy_table
          Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE

          Time taken: 0.199 seconds, Fetched: 66 row(s)


          Just replace stacks in my demo with your tables.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 7 at 9:01









          leftjoin

          7,52421949




          7,52421949












          • thank you very much. i am so sorry that i did not describe my problem. the value in table 2 is not ordered. i will update my problem. but your answer does resolved my origin problem.
            – vito yan
            Nov 7 at 9:31










          • @vitoyan So, table2 has two columns, right? Then use rank by sum of these columns. 1*44 + 1*66 = 1*(44+66) = 110. Use the same dense_rank() over (order by t2.col1+t2.col2 desc) rnk
            – leftjoin
            Nov 7 at 9:47










          • @vitoyan Please vote or accept if you are satisfied with my answer
            – leftjoin
            Nov 7 at 9:50












          • based on my problem describe, you provided a perfect solution. i will vote and accept your answer and post a new question. hope you can still provide a perfect solution. thank you verymuch.
            – vito yan
            Nov 7 at 9:56


















          • thank you very much. i am so sorry that i did not describe my problem. the value in table 2 is not ordered. i will update my problem. but your answer does resolved my origin problem.
            – vito yan
            Nov 7 at 9:31










          • @vitoyan So, table2 has two columns, right? Then use rank by sum of these columns. 1*44 + 1*66 = 1*(44+66) = 110. Use the same dense_rank() over (order by t2.col1+t2.col2 desc) rnk
            – leftjoin
            Nov 7 at 9:47










          • @vitoyan Please vote or accept if you are satisfied with my answer
            – leftjoin
            Nov 7 at 9:50












          • based on my problem describe, you provided a perfect solution. i will vote and accept your answer and post a new question. hope you can still provide a perfect solution. thank you verymuch.
            – vito yan
            Nov 7 at 9:56
















          thank you very much. i am so sorry that i did not describe my problem. the value in table 2 is not ordered. i will update my problem. but your answer does resolved my origin problem.
          – vito yan
          Nov 7 at 9:31




          thank you very much. i am so sorry that i did not describe my problem. the value in table 2 is not ordered. i will update my problem. but your answer does resolved my origin problem.
          – vito yan
          Nov 7 at 9:31












          @vitoyan So, table2 has two columns, right? Then use rank by sum of these columns. 1*44 + 1*66 = 1*(44+66) = 110. Use the same dense_rank() over (order by t2.col1+t2.col2 desc) rnk
          – leftjoin
          Nov 7 at 9:47




          @vitoyan So, table2 has two columns, right? Then use rank by sum of these columns. 1*44 + 1*66 = 1*(44+66) = 110. Use the same dense_rank() over (order by t2.col1+t2.col2 desc) rnk
          – leftjoin
          Nov 7 at 9:47












          @vitoyan Please vote or accept if you are satisfied with my answer
          – leftjoin
          Nov 7 at 9:50






          @vitoyan Please vote or accept if you are satisfied with my answer
          – leftjoin
          Nov 7 at 9:50














          based on my problem describe, you provided a perfect solution. i will vote and accept your answer and post a new question. hope you can still provide a perfect solution. thank you verymuch.
          – vito yan
          Nov 7 at 9:56




          based on my problem describe, you provided a perfect solution. i will vote and accept your answer and post a new question. hope you can still provide a perfect solution. thank you verymuch.
          – vito yan
          Nov 7 at 9:56












          up vote
          0
          down vote













          Please go through this link it may help:
          https://mapr.com/support/s/article/Optimizing-Hive-cross-joins?language=en_US

          Hive supports map-joins only for inner, left and right outer joins. To make sure a product join happens with a map joins, we may have to fake an inner join to do product join. We have to change the split size to say 5 MB and write the query like below. The cross join of the 1,2 in sub queries will now be distributed across all 60 mappers.


          set mapreduce.input.fileinputformat.split.maxsize=5000000



          with Table1 AS
          (Select value, 1 as key from A),
          Table2 AS
          (Select value,1 as key from B)
          Select Table1.A1,
          min(Table1.value * Table2.value)
          from Table1 inner join Table2
          on (Table1.key=Table2.key)
          group by Table1.A1


          Kindly refer to the link for reference





          share|improve this answer





















          • Actually I have demonstrated that Hive transforms Cross join into map join as well. This warning message "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product" says the same, also please have a look at the mapjoin operator in the plan
            – leftjoin
            Nov 7 at 9:52

















          up vote
          0
          down vote













          Please go through this link it may help:
          https://mapr.com/support/s/article/Optimizing-Hive-cross-joins?language=en_US

          Hive supports map-joins only for inner, left and right outer joins. To make sure a product join happens with a map joins, we may have to fake an inner join to do product join. We have to change the split size to say 5 MB and write the query like below. The cross join of the 1,2 in sub queries will now be distributed across all 60 mappers.


          set mapreduce.input.fileinputformat.split.maxsize=5000000



          with Table1 AS
          (Select value, 1 as key from A),
          Table2 AS
          (Select value,1 as key from B)
          Select Table1.A1,
          min(Table1.value * Table2.value)
          from Table1 inner join Table2
          on (Table1.key=Table2.key)
          group by Table1.A1


          Kindly refer to the link for reference





          share|improve this answer





















          • Actually I have demonstrated that Hive transforms Cross join into map join as well. This warning message "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product" says the same, also please have a look at the mapjoin operator in the plan
            – leftjoin
            Nov 7 at 9:52















          up vote
          0
          down vote










          up vote
          0
          down vote









          Please go through this link it may help:
          https://mapr.com/support/s/article/Optimizing-Hive-cross-joins?language=en_US

          Hive supports map-joins only for inner, left and right outer joins. To make sure a product join happens with a map joins, we may have to fake an inner join to do product join. We have to change the split size to say 5 MB and write the query like below. The cross join of the 1,2 in sub queries will now be distributed across all 60 mappers.


          set mapreduce.input.fileinputformat.split.maxsize=5000000



          with Table1 AS
          (Select value, 1 as key from A),
          Table2 AS
          (Select value,1 as key from B)
          Select Table1.A1,
          min(Table1.value * Table2.value)
          from Table1 inner join Table2
          on (Table1.key=Table2.key)
          group by Table1.A1


          Kindly refer to the link for reference





          share|improve this answer












          Please go through this link it may help:
          https://mapr.com/support/s/article/Optimizing-Hive-cross-joins?language=en_US

          Hive supports map-joins only for inner, left and right outer joins. To make sure a product join happens with a map joins, we may have to fake an inner join to do product join. We have to change the split size to say 5 MB and write the query like below. The cross join of the 1,2 in sub queries will now be distributed across all 60 mappers.


          set mapreduce.input.fileinputformat.split.maxsize=5000000



          with Table1 AS
          (Select value, 1 as key from A),
          Table2 AS
          (Select value,1 as key from B)
          Select Table1.A1,
          min(Table1.value * Table2.value)
          from Table1 inner join Table2
          on (Table1.key=Table2.key)
          group by Table1.A1


          Kindly refer to the link for reference






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 7 at 9:49









          Debabrata

          4618




          4618












          • Actually I have demonstrated that Hive transforms Cross join into map join as well. This warning message "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product" says the same, also please have a look at the mapjoin operator in the plan
            – leftjoin
            Nov 7 at 9:52




















          • Actually I have demonstrated that Hive transforms Cross join into map join as well. This warning message "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product" says the same, also please have a look at the mapjoin operator in the plan
            – leftjoin
            Nov 7 at 9:52


















          Actually I have demonstrated that Hive transforms Cross join into map join as well. This warning message "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product" says the same, also please have a look at the mapjoin operator in the plan
          – leftjoin
          Nov 7 at 9:52






          Actually I have demonstrated that Hive transforms Cross join into map join as well. This warning message "Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product" says the same, also please have a look at the mapjoin operator in the plan
          – leftjoin
          Nov 7 at 9:52




















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53184889%2fhow-avoid-cross-join-in-hive%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          Tangent Lines Diagram Along Smooth Curve

          Yusuf al-Mu'taman ibn Hud

          Zucchini