Google BigQuery: Last modified datetime of a row
up vote
0
down vote
favorite
I am trying to measure duration of Dataflow pipeline which pulls messages from Pub/Sub and loads them to a BigQuery table. I cannot find how to get the last modified time of a row in BigQuery table though there is a last modified datetime of table.
Does anyone know how to set last modified datetime to row of BigQuery table?
add a comment |
up vote
0
down vote
favorite
I am trying to measure duration of Dataflow pipeline which pulls messages from Pub/Sub and loads them to a BigQuery table. I cannot find how to get the last modified time of a row in BigQuery table though there is a last modified datetime of table.
Does anyone know how to set last modified datetime to row of BigQuery table?
You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18
But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I am trying to measure duration of Dataflow pipeline which pulls messages from Pub/Sub and loads them to a BigQuery table. I cannot find how to get the last modified time of a row in BigQuery table though there is a last modified datetime of table.
Does anyone know how to set last modified datetime to row of BigQuery table?
I am trying to measure duration of Dataflow pipeline which pulls messages from Pub/Sub and loads them to a BigQuery table. I cannot find how to get the last modified time of a row in BigQuery table though there is a last modified datetime of table.
Does anyone know how to set last modified datetime to row of BigQuery table?
asked Nov 7 at 9:01
Kenta Kozuka
82
82
You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18
But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01
add a comment |
You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18
But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01
You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18
You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18
But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01
But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).
Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.
You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.
Include these new columns respectively to the table schema of the output BigQuery table.
Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14
and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46
What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34
Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or withbq show -jwith a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).
Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.
You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.
Include these new columns respectively to the table schema of the output BigQuery table.
Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14
and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46
What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34
Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or withbq show -jwith a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38
add a comment |
up vote
1
down vote
You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).
Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.
You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.
Include these new columns respectively to the table schema of the output BigQuery table.
Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14
and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46
What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34
Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or withbq show -jwith a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38
add a comment |
up vote
1
down vote
up vote
1
down vote
You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).
Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.
You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.
Include these new columns respectively to the table schema of the output BigQuery table.
You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).
Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.
You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.
Include these new columns respectively to the table schema of the output BigQuery table.
answered Nov 7 at 15:11
mremes
563
563
Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14
and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46
What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34
Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or withbq show -jwith a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38
add a comment |
Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14
and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46
What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34
Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or withbq show -jwith a ProcessBuilder.
– Lefteris S
Nov 12 at 13:38
Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14
Thanks for your answer. According your suggestion, I can know the time on which original pubsub messages are generated, Dataflow process starts, and Dataflow starts loading them to BigQuery (this is the time before loading). Is it possible to know time on which the data are loaded into BigQuery (this is the time just after loading).
– Kenta Kozuka
Nov 8 at 2:14
and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46
and it would be nice if I can measure the duration of each load jobs.
– Kenta Kozuka
Nov 8 at 2:46
What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34
What you are looking for is essentially the table creation time and the job duration. The former can be found in your destination's meta-table which you can fetch with a BigQueryIO.read() fromQuery() call or using a bigquery client library (either python or java, depending on which dataflow SDK you're using).
– Lefteris S
Nov 12 at 13:34
Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with
bq show -j with a ProcessBuilder.– Lefteris S
Nov 12 at 13:38
Job duration it's part of job information and you could also get it by calling BigQuery directly(i.e. not via BigQueryIO). If you're using python, there's a specific method to get the job details. If you're using java, you might have to get it it via a request over REST API or with
bq show -j with a ProcessBuilder.– Lefteris S
Nov 12 at 13:38
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186251%2fgoogle-bigquery-last-modified-datetime-of-a-row%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You could add it manually in your pipeline, or you could use stackdriver logging/monitoring to get this info as well.
– Serge Hendrickx
Nov 7 at 9:18
But I cannot know when the load job of Dataflow actually finishes inserting a row even if you add timestamp field or logging in the codes (please correct if I am wrong). I want to know the duration from time on which one service publishes a message to time on which this message become available in BigQuery.
– Kenta Kozuka
Nov 7 at 10:01