BigQuery: Atomically replace a date partition using DML











up vote
1
down vote

favorite












I often want to load one day's worth of data into a date-partitioned BigQuery table, replacing any data that's already there. I know how to do this for 'old-style' data partitioned tables (the ones that have a _PARTITIONTIME field) but don't know how to do this with the new-style date-partitioned tables (which use a normal date/timestamp column to specify the partitioning because they don't allow one to use the $ decorator.



Let's say I want to do this on my_table. With old-style date-partitioned tables, I accomplished this using a load job that utilized the $ decorator and the WRITE_TRUNCATE write disposition -- e.g., I'd set the destination table to be my_table$20181005.



However, I'm not sure how to perform the equivalent operation using a DML. I find myself performing separate DELETE and INSERT commands. This isn't great because it increases complexity, the number of queries, and the operation isn't atomic.



I want to know how to do this using the MERGE command to keep this all contained within a single, atomic operation. However I can't wrap my head around the MERGE command's syntax and haven't found an example for this use case. Does anyone know how this should be done?



The ideal answer would be a DML statement that selected all columns from source_table and inserted it into the 2018-10-05 date partition of my_table, deleting any existing data that was in my_table's 2018-10-05 date partition. We can assume that source_table and my_table have the same schemas, and that my_table is partitioned on the day column, which is of type DATE.










share|improve this question




























    up vote
    1
    down vote

    favorite












    I often want to load one day's worth of data into a date-partitioned BigQuery table, replacing any data that's already there. I know how to do this for 'old-style' data partitioned tables (the ones that have a _PARTITIONTIME field) but don't know how to do this with the new-style date-partitioned tables (which use a normal date/timestamp column to specify the partitioning because they don't allow one to use the $ decorator.



    Let's say I want to do this on my_table. With old-style date-partitioned tables, I accomplished this using a load job that utilized the $ decorator and the WRITE_TRUNCATE write disposition -- e.g., I'd set the destination table to be my_table$20181005.



    However, I'm not sure how to perform the equivalent operation using a DML. I find myself performing separate DELETE and INSERT commands. This isn't great because it increases complexity, the number of queries, and the operation isn't atomic.



    I want to know how to do this using the MERGE command to keep this all contained within a single, atomic operation. However I can't wrap my head around the MERGE command's syntax and haven't found an example for this use case. Does anyone know how this should be done?



    The ideal answer would be a DML statement that selected all columns from source_table and inserted it into the 2018-10-05 date partition of my_table, deleting any existing data that was in my_table's 2018-10-05 date partition. We can assume that source_table and my_table have the same schemas, and that my_table is partitioned on the day column, which is of type DATE.










    share|improve this question


























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I often want to load one day's worth of data into a date-partitioned BigQuery table, replacing any data that's already there. I know how to do this for 'old-style' data partitioned tables (the ones that have a _PARTITIONTIME field) but don't know how to do this with the new-style date-partitioned tables (which use a normal date/timestamp column to specify the partitioning because they don't allow one to use the $ decorator.



      Let's say I want to do this on my_table. With old-style date-partitioned tables, I accomplished this using a load job that utilized the $ decorator and the WRITE_TRUNCATE write disposition -- e.g., I'd set the destination table to be my_table$20181005.



      However, I'm not sure how to perform the equivalent operation using a DML. I find myself performing separate DELETE and INSERT commands. This isn't great because it increases complexity, the number of queries, and the operation isn't atomic.



      I want to know how to do this using the MERGE command to keep this all contained within a single, atomic operation. However I can't wrap my head around the MERGE command's syntax and haven't found an example for this use case. Does anyone know how this should be done?



      The ideal answer would be a DML statement that selected all columns from source_table and inserted it into the 2018-10-05 date partition of my_table, deleting any existing data that was in my_table's 2018-10-05 date partition. We can assume that source_table and my_table have the same schemas, and that my_table is partitioned on the day column, which is of type DATE.










      share|improve this question















      I often want to load one day's worth of data into a date-partitioned BigQuery table, replacing any data that's already there. I know how to do this for 'old-style' data partitioned tables (the ones that have a _PARTITIONTIME field) but don't know how to do this with the new-style date-partitioned tables (which use a normal date/timestamp column to specify the partitioning because they don't allow one to use the $ decorator.



      Let's say I want to do this on my_table. With old-style date-partitioned tables, I accomplished this using a load job that utilized the $ decorator and the WRITE_TRUNCATE write disposition -- e.g., I'd set the destination table to be my_table$20181005.



      However, I'm not sure how to perform the equivalent operation using a DML. I find myself performing separate DELETE and INSERT commands. This isn't great because it increases complexity, the number of queries, and the operation isn't atomic.



      I want to know how to do this using the MERGE command to keep this all contained within a single, atomic operation. However I can't wrap my head around the MERGE command's syntax and haven't found an example for this use case. Does anyone know how this should be done?



      The ideal answer would be a DML statement that selected all columns from source_table and inserted it into the 2018-10-05 date partition of my_table, deleting any existing data that was in my_table's 2018-10-05 date partition. We can assume that source_table and my_table have the same schemas, and that my_table is partitioned on the day column, which is of type DATE.







      sql google-bigquery






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 7 at 16:08

























      asked Nov 7 at 8:58









      conradlee

      3,59263562




      3,59263562
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted











          because they don't allow one to use the $ decorator




          But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:



          $ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"


          Then I loaded into a specific partition:



          $ echo "1,3.14,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv


          I tried to load into the wrong partition for the input data, and received an error:



          $ echo "1,3.14,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181105" ./row.csv
          Some rows belong to different partitions rather than destination partition 20181105


          I then replaced the data for the partition:



          $ echo "2,0.11,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv


          Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.






          share|improve this answer





















          • Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
            – conradlee
            Nov 8 at 10:11











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186198%2fbigquery-atomically-replace-a-date-partition-using-dml%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote



          accepted











          because they don't allow one to use the $ decorator




          But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:



          $ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"


          Then I loaded into a specific partition:



          $ echo "1,3.14,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv


          I tried to load into the wrong partition for the input data, and received an error:



          $ echo "1,3.14,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181105" ./row.csv
          Some rows belong to different partitions rather than destination partition 20181105


          I then replaced the data for the partition:



          $ echo "2,0.11,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv


          Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.






          share|improve this answer





















          • Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
            – conradlee
            Nov 8 at 10:11















          up vote
          2
          down vote



          accepted











          because they don't allow one to use the $ decorator




          But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:



          $ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"


          Then I loaded into a specific partition:



          $ echo "1,3.14,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv


          I tried to load into the wrong partition for the input data, and received an error:



          $ echo "1,3.14,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181105" ./row.csv
          Some rows belong to different partitions rather than destination partition 20181105


          I then replaced the data for the partition:



          $ echo "2,0.11,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv


          Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.






          share|improve this answer





















          • Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
            – conradlee
            Nov 8 at 10:11













          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted







          because they don't allow one to use the $ decorator




          But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:



          $ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"


          Then I loaded into a specific partition:



          $ echo "1,3.14,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv


          I tried to load into the wrong partition for the input data, and received an error:



          $ echo "1,3.14,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181105" ./row.csv
          Some rows belong to different partitions rather than destination partition 20181105


          I then replaced the data for the partition:



          $ echo "2,0.11,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv


          Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.






          share|improve this answer













          because they don't allow one to use the $ decorator




          But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:



          $ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"


          Then I loaded into a specific partition:



          $ echo "1,3.14,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv


          I tried to load into the wrong partition for the input data, and received an error:



          $ echo "1,3.14,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181105" ./row.csv
          Some rows belong to different partitions rather than destination partition 20181105


          I then replaced the data for the partition:



          $ echo "2,0.11,2018-11-07" > row.csv
          $ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv


          Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 7 at 18:18









          Elliott Brossard

          15.3k2933




          15.3k2933












          • Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
            – conradlee
            Nov 8 at 10:11


















          • Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
            – conradlee
            Nov 8 at 10:11
















          Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
          – conradlee
          Nov 8 at 10:11




          Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
          – conradlee
          Nov 8 at 10:11


















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186198%2fbigquery-atomically-replace-a-date-partition-using-dml%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          Tangent Lines Diagram Along Smooth Curve

          Yusuf al-Mu'taman ibn Hud

          Zucchini