Google BigQuery: Last modified datetime of a row

up vote
0
down vote

favorite

I am trying to measure duration of Dataflow pipeline which pulls messages from Pub/Sub and loads them to a BigQuery table. I cannot find how to get the last modified time of a row in BigQuery table though there is a last modified datetime of table.
Does anyone know how to set last modified datetime to row of BigQuery table?

asked Nov 7 at 9:01

Kenta Kozuka

You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18

But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01

add a comment |

up vote
0
down vote

favorite

asked Nov 7 at 9:01

Kenta Kozuka

You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18

But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01

add a comment |

up vote
0
down vote

favorite

asked Nov 7 at 9:01

Kenta Kozuka

google-bigquery dataflow

asked Nov 7 at 9:01

Kenta Kozuka

asked Nov 7 at 9:01

Kenta Kozuka

asked Nov 7 at 9:01

Kenta Kozuka

asked Nov 7 at 9:01

Kenta Kozuka

asked Nov 7 at 9:01

Kenta Kozuka

You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18

But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01

add a comment |

You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18

But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01

You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18

But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).

Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.

You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.

Include these new columns respectively to the table schema of the output BigQuery table.

answered Nov 7 at 15:11

mremes

563

Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14

and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46

What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34

Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186251%2fgoogle-bigquery-last-modified-datetime-of-a-row%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.

You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.

Include these new columns respectively to the table schema of the output BigQuery table.

answered Nov 7 at 15:11

mremes

563

Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14

and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46

What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34

Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38

add a comment |

up vote
1
down vote

Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.

You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.

Include these new columns respectively to the table schema of the output BigQuery table.

answered Nov 7 at 15:11

mremes

563

Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14

and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46

What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34

Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38

add a comment |

up vote
1
down vote

Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.

You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.

Include these new columns respectively to the table schema of the output BigQuery table.

answered Nov 7 at 15:11

mremes

563

Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.

You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.

Include these new columns respectively to the table schema of the output BigQuery table.

answered Nov 7 at 15:11

mremes

563

answered Nov 7 at 15:11

mremes

563

answered Nov 7 at 15:11

mremes

563

answered Nov 7 at 15:11

mremes

563

Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14

and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46

What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34

Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38

add a comment |

Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14

and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46

What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34

Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38

Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14

and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46

What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34

Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with bq show -j with a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk