tensorflow - tf.data.Dataset randomly skip samples before batching to get different batches
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
My model consumes chronologically ordered sequences within each input batch. Therefore, i am creating batches before shuffling my input data. This brings the issue that batches always include the same data samples across the whole dataset (starting with the same indices - shifted by batch_size
), i solved this issue by caching the initial dataset and sampling from skipped datasets, however this eats up memory pretty fast (though my dataset has only 150MB):
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))
dataset = dataset.map(process_fn, num_parallel_calls=8)
dataset = dataset.cache()
datasets =
for i in range(0, batch_size):
d = dataset.skip(i)
d = d.batch(batch_size, drop_remainder=True)
datasets.append(d)
dataset = tf.data.experimental.sample_from_datasets(datasets)
dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
dataset = dataset.repeat()
Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.
python tensorflow tensorflow-datasets tensorflow-estimator
add a comment |
My model consumes chronologically ordered sequences within each input batch. Therefore, i am creating batches before shuffling my input data. This brings the issue that batches always include the same data samples across the whole dataset (starting with the same indices - shifted by batch_size
), i solved this issue by caching the initial dataset and sampling from skipped datasets, however this eats up memory pretty fast (though my dataset has only 150MB):
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))
dataset = dataset.map(process_fn, num_parallel_calls=8)
dataset = dataset.cache()
datasets =
for i in range(0, batch_size):
d = dataset.skip(i)
d = d.batch(batch_size, drop_remainder=True)
datasets.append(d)
dataset = tf.data.experimental.sample_from_datasets(datasets)
dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
dataset = dataset.repeat()
Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.
python tensorflow tensorflow-datasets tensorflow-estimator
Do you find a better way to use less memory?
– Cloud Cho
Mar 26 at 0:04
Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.
– Chocolate
Apr 3 at 21:25
I see your setbuffere_size
as 3,000. Have you tried smaller number like 3? Also whatdataset
dimension and shape?
– Cloud Cho
Apr 3 at 22:25
You should give more code likeprocess_fn
andbatch_size
so that we can reproduce the problem.
– giser_yugang
Apr 4 at 2:11
batch_size
is 128
– Chocolate
Apr 6 at 21:53
add a comment |
My model consumes chronologically ordered sequences within each input batch. Therefore, i am creating batches before shuffling my input data. This brings the issue that batches always include the same data samples across the whole dataset (starting with the same indices - shifted by batch_size
), i solved this issue by caching the initial dataset and sampling from skipped datasets, however this eats up memory pretty fast (though my dataset has only 150MB):
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))
dataset = dataset.map(process_fn, num_parallel_calls=8)
dataset = dataset.cache()
datasets =
for i in range(0, batch_size):
d = dataset.skip(i)
d = d.batch(batch_size, drop_remainder=True)
datasets.append(d)
dataset = tf.data.experimental.sample_from_datasets(datasets)
dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
dataset = dataset.repeat()
Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.
python tensorflow tensorflow-datasets tensorflow-estimator
My model consumes chronologically ordered sequences within each input batch. Therefore, i am creating batches before shuffling my input data. This brings the issue that batches always include the same data samples across the whole dataset (starting with the same indices - shifted by batch_size
), i solved this issue by caching the initial dataset and sampling from skipped datasets, however this eats up memory pretty fast (though my dataset has only 150MB):
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))
dataset = dataset.map(process_fn, num_parallel_calls=8)
dataset = dataset.cache()
datasets =
for i in range(0, batch_size):
d = dataset.skip(i)
d = d.batch(batch_size, drop_remainder=True)
datasets.append(d)
dataset = tf.data.experimental.sample_from_datasets(datasets)
dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
dataset = dataset.repeat()
Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.
python tensorflow tensorflow-datasets tensorflow-estimator
python tensorflow tensorflow-datasets tensorflow-estimator
edited Nov 28 '18 at 10:11
Chocolate
asked Nov 23 '18 at 19:14
ChocolateChocolate
1091212
1091212
Do you find a better way to use less memory?
– Cloud Cho
Mar 26 at 0:04
Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.
– Chocolate
Apr 3 at 21:25
I see your setbuffere_size
as 3,000. Have you tried smaller number like 3? Also whatdataset
dimension and shape?
– Cloud Cho
Apr 3 at 22:25
You should give more code likeprocess_fn
andbatch_size
so that we can reproduce the problem.
– giser_yugang
Apr 4 at 2:11
batch_size
is 128
– Chocolate
Apr 6 at 21:53
add a comment |
Do you find a better way to use less memory?
– Cloud Cho
Mar 26 at 0:04
Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.
– Chocolate
Apr 3 at 21:25
I see your setbuffere_size
as 3,000. Have you tried smaller number like 3? Also whatdataset
dimension and shape?
– Cloud Cho
Apr 3 at 22:25
You should give more code likeprocess_fn
andbatch_size
so that we can reproduce the problem.
– giser_yugang
Apr 4 at 2:11
batch_size
is 128
– Chocolate
Apr 6 at 21:53
Do you find a better way to use less memory?
– Cloud Cho
Mar 26 at 0:04
Do you find a better way to use less memory?
– Cloud Cho
Mar 26 at 0:04
Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.
– Chocolate
Apr 3 at 21:25
Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.
– Chocolate
Apr 3 at 21:25
I see your set
buffere_size
as 3,000. Have you tried smaller number like 3? Also what dataset
dimension and shape?– Cloud Cho
Apr 3 at 22:25
I see your set
buffere_size
as 3,000. Have you tried smaller number like 3? Also what dataset
dimension and shape?– Cloud Cho
Apr 3 at 22:25
You should give more code like
process_fn
and batch_size
so that we can reproduce the problem.– giser_yugang
Apr 4 at 2:11
You should give more code like
process_fn
and batch_size
so that we can reproduce the problem.– giser_yugang
Apr 4 at 2:11
batch_size
is 128– Chocolate
Apr 6 at 21:53
batch_size
is 128– Chocolate
Apr 6 at 21:53
add a comment |
1 Answer
1
active
oldest
votes
You are eating up memory because you are shuffling entire batches -- also skipping may not be very efficient. Since your data seems to be entire in memory, you could possibly sample your data directly in python without too much concern about performance:
def make_batch(start_idx):
batch = np.empty((batch_size, window_size), dtype=data.dtype)
for batch_idx, data_idx in enumerate(
range(start_idx, start_idx + window_shift * batch_size, window_shift)):
batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]
return batch
dataset = (tf.data.Dataset
.range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))
.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
.map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32
.repeat()
.prefetch(1)) # you might want to consider prefetching for performance
The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.
Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.
– Chocolate
Apr 6 at 21:52
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451927%2ftensorflow-tf-data-dataset-randomly-skip-samples-before-batching-to-get-differ%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You are eating up memory because you are shuffling entire batches -- also skipping may not be very efficient. Since your data seems to be entire in memory, you could possibly sample your data directly in python without too much concern about performance:
def make_batch(start_idx):
batch = np.empty((batch_size, window_size), dtype=data.dtype)
for batch_idx, data_idx in enumerate(
range(start_idx, start_idx + window_shift * batch_size, window_shift)):
batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]
return batch
dataset = (tf.data.Dataset
.range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))
.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
.map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32
.repeat()
.prefetch(1)) # you might want to consider prefetching for performance
The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.
Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.
– Chocolate
Apr 6 at 21:52
add a comment |
You are eating up memory because you are shuffling entire batches -- also skipping may not be very efficient. Since your data seems to be entire in memory, you could possibly sample your data directly in python without too much concern about performance:
def make_batch(start_idx):
batch = np.empty((batch_size, window_size), dtype=data.dtype)
for batch_idx, data_idx in enumerate(
range(start_idx, start_idx + window_shift * batch_size, window_shift)):
batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]
return batch
dataset = (tf.data.Dataset
.range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))
.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
.map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32
.repeat()
.prefetch(1)) # you might want to consider prefetching for performance
The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.
Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.
– Chocolate
Apr 6 at 21:52
add a comment |
You are eating up memory because you are shuffling entire batches -- also skipping may not be very efficient. Since your data seems to be entire in memory, you could possibly sample your data directly in python without too much concern about performance:
def make_batch(start_idx):
batch = np.empty((batch_size, window_size), dtype=data.dtype)
for batch_idx, data_idx in enumerate(
range(start_idx, start_idx + window_shift * batch_size, window_shift)):
batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]
return batch
dataset = (tf.data.Dataset
.range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))
.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
.map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32
.repeat()
.prefetch(1)) # you might want to consider prefetching for performance
The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.
You are eating up memory because you are shuffling entire batches -- also skipping may not be very efficient. Since your data seems to be entire in memory, you could possibly sample your data directly in python without too much concern about performance:
def make_batch(start_idx):
batch = np.empty((batch_size, window_size), dtype=data.dtype)
for batch_idx, data_idx in enumerate(
range(start_idx, start_idx + window_shift * batch_size, window_shift)):
batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]
return batch
dataset = (tf.data.Dataset
.range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))
.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
.map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32
.repeat()
.prefetch(1)) # you might want to consider prefetching for performance
The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.
answered Apr 4 at 7:03
P-GnP-Gn
12.6k13768
12.6k13768
Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.
– Chocolate
Apr 6 at 21:52
add a comment |
Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.
– Chocolate
Apr 6 at 21:52
Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.
– Chocolate
Apr 6 at 21:52
Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.
– Chocolate
Apr 6 at 21:52
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451927%2ftensorflow-tf-data-dataset-randomly-skip-samples-before-batching-to-get-differ%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Do you find a better way to use less memory?
– Cloud Cho
Mar 26 at 0:04
Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.
– Chocolate
Apr 3 at 21:25
I see your set
buffere_size
as 3,000. Have you tried smaller number like 3? Also whatdataset
dimension and shape?– Cloud Cho
Apr 3 at 22:25
You should give more code like
process_fn
andbatch_size
so that we can reproduce the problem.– giser_yugang
Apr 4 at 2:11
batch_size
is 128– Chocolate
Apr 6 at 21:53