Google BigQuery: Last modified datetime of a row











up vote
0
down vote

favorite












I am trying to measure duration of Dataflow pipeline which pulls messages from Pub/Sub and loads them to a BigQuery table. I cannot find how to get the last modified time of a row in BigQuery table though there is a last modified datetime of table.
Does anyone know how to set last modified datetime to row of BigQuery table?









share






















  • You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
    – Serge Hendrickx
    Nov 7 at 9:18










  • But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
    – Kenta Kozuka
    Nov 7 at 10:01















up vote
0
down vote

favorite












I am trying to measure duration of Dataflow pipeline which pulls messages from Pub/Sub and loads them to a BigQuery table. I cannot find how to get the last modified time of a row in BigQuery table though there is a last modified datetime of table.
Does anyone know how to set last modified datetime to row of BigQuery table?









share






















  • You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
    – Serge Hendrickx
    Nov 7 at 9:18










  • But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
    – Kenta Kozuka
    Nov 7 at 10:01













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I am trying to measure duration of Dataflow pipeline which pulls messages from Pub/Sub and loads them to a BigQuery table. I cannot find how to get the last modified time of a row in BigQuery table though there is a last modified datetime of table.
Does anyone know how to set last modified datetime to row of BigQuery table?









share













I am trying to measure duration of Dataflow pipeline which pulls messages from Pub/Sub and loads them to a BigQuery table. I cannot find how to get the last modified time of a row in BigQuery table though there is a last modified datetime of table.
Does anyone know how to set last modified datetime to row of BigQuery table?







google-bigquery dataflow





share












share










share



share










asked Nov 7 at 9:01









Kenta Kozuka

82




82












  • You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
    – Serge Hendrickx
    Nov 7 at 9:18










  • But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
    – Kenta Kozuka
    Nov 7 at 10:01


















  • You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
    – Serge Hendrickx
    Nov 7 at 9:18










  • But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
    – Kenta Kozuka
    Nov 7 at 10:01
















You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18




You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18












But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01




But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01












1 Answer
1






active

oldest

votes

















up vote
1
down vote













You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).



Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.



You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.



Include these new columns respectively to the table schema of the output BigQuery table.






share|improve this answer





















  • Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
    – Kenta Kozuka
    Nov 8 at 2:14










  • and it would be nice if I can measure the duration of each load jobs.
    – Kenta Kozuka
    Nov 8 at 2:46










  • What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
    – Lefteris S
    Nov 12 at 13:34










  • Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
    – Lefteris S
    Nov 12 at 13:38











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186251%2fgoogle-bigquery-last-modified-datetime-of-a-row%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote













You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).



Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.



You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.



Include these new columns respectively to the table schema of the output BigQuery table.






share|improve this answer





















  • Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
    – Kenta Kozuka
    Nov 8 at 2:14










  • and it would be nice if I can measure the duration of each load jobs.
    – Kenta Kozuka
    Nov 8 at 2:46










  • What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
    – Lefteris S
    Nov 12 at 13:34










  • Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
    – Lefteris S
    Nov 12 at 13:38















up vote
1
down vote













You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).



Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.



You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.



Include these new columns respectively to the table schema of the output BigQuery table.






share|improve this answer





















  • Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
    – Kenta Kozuka
    Nov 8 at 2:14










  • and it would be nice if I can measure the duration of each load jobs.
    – Kenta Kozuka
    Nov 8 at 2:46










  • What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
    – Lefteris S
    Nov 12 at 13:34










  • Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
    – Lefteris S
    Nov 12 at 13:38













up vote
1
down vote










up vote
1
down vote









You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).



Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.



You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.



Include these new columns respectively to the table schema of the output BigQuery table.






share|improve this answer












You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).



Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.



You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.



Include these new columns respectively to the table schema of the output BigQuery table.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 7 at 15:11









mremes

563




563












  • Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
    – Kenta Kozuka
    Nov 8 at 2:14










  • and it would be nice if I can measure the duration of each load jobs.
    – Kenta Kozuka
    Nov 8 at 2:46










  • What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
    – Lefteris S
    Nov 12 at 13:34










  • Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
    – Lefteris S
    Nov 12 at 13:38


















  • Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
    – Kenta Kozuka
    Nov 8 at 2:14










  • and it would be nice if I can measure the duration of each load jobs.
    – Kenta Kozuka
    Nov 8 at 2:46










  • What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
    – Lefteris S
    Nov 12 at 13:34










  • Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
    – Lefteris S
    Nov 12 at 13:38
















Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14




Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14












and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46




and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46












What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34




What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34












Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38




Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186251%2fgoogle-bigquery-last-modified-datetime-of-a-row%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Academy of Television Arts & Sciences

L'Équipe

1995 France bombings