BigQuery: Atomically replace a date partition using DML

up vote
1
down vote

favorite

I often want to load one day's worth of data into a date-partitioned BigQuery table, replacing any data that's already there. I know how to do this for 'old-style' data partitioned tables (the ones that have a _PARTITIONTIME field) but don't know how to do this with the new-style date-partitioned tables (which use a normal date/timestamp column to specify the partitioning because they don't allow one to use the $ decorator.

Let's say I want to do this on my_table. With old-style date-partitioned tables, I accomplished this using a load job that utilized the $ decorator and the WRITE_TRUNCATE write disposition -- e.g., I'd set the destination table to be my_table$20181005.

However, I'm not sure how to perform the equivalent operation using a DML. I find myself performing separate DELETE and INSERT commands. This isn't great because it increases complexity, the number of queries, and the operation isn't atomic.

I want to know how to do this using the MERGE command to keep this all contained within a single, atomic operation. However I can't wrap my head around the MERGE command's syntax and haven't found an example for this use case. Does anyone know how this should be done?

The ideal answer would be a DML statement that selected all columns from source_table and inserted it into the 2018-10-05 date partition of my_table, deleting any existing data that was in my_table's 2018-10-05 date partition. We can assume that source_table and my_table have the same schemas, and that my_table is partitioned on the day column, which is of type DATE.

edited Nov 7 at 16:08

asked Nov 7 at 8:58

conradlee

3,59263562

add a comment |

up vote
1
down vote

favorite

edited Nov 7 at 16:08

asked Nov 7 at 8:58

conradlee

3,59263562

add a comment |

up vote
1
down vote

favorite

edited Nov 7 at 16:08

asked Nov 7 at 8:58

conradlee

3,59263562

sql google-bigquery

edited Nov 7 at 16:08

asked Nov 7 at 8:58

conradlee

3,59263562

edited Nov 7 at 16:08

asked Nov 7 at 8:58

conradlee

3,59263562

edited Nov 7 at 16:08

asked Nov 7 at 8:58

conradlee

3,59263562

asked Nov 7 at 8:58

conradlee

3,59263562

asked Nov 7 at 8:58

conradlee

3,59263562

add a comment |

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

because they don't allow one to use the $ decorator

But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:

$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"

Then I loaded into a specific partition:

$ echo "1,3.14,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv

I tried to load into the wrong partition for the input data, and received an error:

$ echo "1,3.14,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181105" ./row.csv

Some rows belong to different partitions rather than destination partition 20181105

I then replaced the data for the partition:

$ echo "2,0.11,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv

Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.

answered Nov 7 at 18:18

Elliott Brossard

15.3k2933

Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
– conradlee
Nov 8 at 10:11

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186198%2fbigquery-atomically-replace-a-date-partition-using-dml%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

because they don't allow one to use the $ decorator

But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:

$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"

Then I loaded into a specific partition:

$ echo "1,3.14,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv

I tried to load into the wrong partition for the input data, and received an error:

$ echo "1,3.14,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181105" ./row.csv

Some rows belong to different partitions rather than destination partition 20181105

I then replaced the data for the partition:

$ echo "2,0.11,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv

Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.

answered Nov 7 at 18:18

Elliott Brossard

15.3k2933

Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
– conradlee
Nov 8 at 10:11

add a comment |

up vote
2
down vote

accepted

because they don't allow one to use the $ decorator

But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:

$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"

Then I loaded into a specific partition:

$ echo "1,3.14,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv

I tried to load into the wrong partition for the input data, and received an error:

$ echo "1,3.14,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181105" ./row.csv

Some rows belong to different partitions rather than destination partition 20181105

I then replaced the data for the partition:

$ echo "2,0.11,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv

Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.

answered Nov 7 at 18:18

Elliott Brossard

15.3k2933

Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
– conradlee
Nov 8 at 10:11

add a comment |

up vote
2
down vote

accepted

because they don't allow one to use the $ decorator

But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:

$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"

Then I loaded into a specific partition:

$ echo "1,3.14,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv

I tried to load into the wrong partition for the input data, and received an error:

$ echo "1,3.14,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181105" ./row.csv

Some rows belong to different partitions rather than destination partition 20181105

I then replaced the data for the partition:

$ echo "2,0.11,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv

Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.

answered Nov 7 at 18:18

Elliott Brossard

15.3k2933

because they don't allow one to use the $ decorator

But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:

$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"

Then I loaded into a specific partition:

$ echo "1,3.14,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv

I tried to load into the wrong partition for the input data, and received an error:

$ echo "1,3.14,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181105" ./row.csv

Some rows belong to different partitions rather than destination partition 20181105

I then replaced the data for the partition:

$ echo "2,0.11,2018-11-07" > row.csv

$ bq "tmp_elliottb.PartitionedTable$20181107" ./row.csv

Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.

answered Nov 7 at 18:18

Elliott Brossard

15.3k2933

answered Nov 7 at 18:18

Elliott Brossard

15.3k2933

answered Nov 7 at 18:18

Elliott Brossard

15.3k2933

answered Nov 7 at 18:18

Elliott Brossard

15.3k2933

Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
– conradlee
Nov 8 at 10:11

add a comment |

Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
– conradlee
Nov 8 at 10:11

Thanks! I went back and looked at why I thought this wasn't possible, and it was because of an error message I got from Airflow (which in this case doesn't accurately reflect the BigQuery API). This solution is much better than DML for my use-case because it handles the case where schemas are additively updated over time (and thus inconsistent), which I can't seem to handle with the DML because it requires me to explicitly list out all the columns I'm going to insert into a table.
– conradlee
Nov 8 at 10:11

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk