tensorflow - tf.data.Dataset randomly skip samples before batching to get different batches





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







3















My model consumes chronologically ordered sequences within each input batch. Therefore, i am creating batches before shuffling my input data. This brings the issue that batches always include the same data samples across the whole dataset (starting with the same indices - shifted by batch_size), i solved this issue by caching the initial dataset and sampling from skipped datasets, however this eats up memory pretty fast (though my dataset has only 150MB):



dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))
dataset = dataset.map(process_fn, num_parallel_calls=8)
dataset = dataset.cache()
datasets =
for i in range(0, batch_size):
d = dataset.skip(i)
d = d.batch(batch_size, drop_remainder=True)
datasets.append(d)
dataset = tf.data.experimental.sample_from_datasets(datasets)
dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
dataset = dataset.repeat()


Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.










share|improve this question

























  • Do you find a better way to use less memory?

    – Cloud Cho
    Mar 26 at 0:04













  • Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.

    – Chocolate
    Apr 3 at 21:25











  • I see your set buffere_size as 3,000. Have you tried smaller number like 3? Also what dataset dimension and shape?

    – Cloud Cho
    Apr 3 at 22:25











  • You should give more code like process_fn and batch_size so that we can reproduce the problem.

    – giser_yugang
    Apr 4 at 2:11











  • batch_size is 128

    – Chocolate
    Apr 6 at 21:53




















3















My model consumes chronologically ordered sequences within each input batch. Therefore, i am creating batches before shuffling my input data. This brings the issue that batches always include the same data samples across the whole dataset (starting with the same indices - shifted by batch_size), i solved this issue by caching the initial dataset and sampling from skipped datasets, however this eats up memory pretty fast (though my dataset has only 150MB):



dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))
dataset = dataset.map(process_fn, num_parallel_calls=8)
dataset = dataset.cache()
datasets =
for i in range(0, batch_size):
d = dataset.skip(i)
d = d.batch(batch_size, drop_remainder=True)
datasets.append(d)
dataset = tf.data.experimental.sample_from_datasets(datasets)
dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
dataset = dataset.repeat()


Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.










share|improve this question

























  • Do you find a better way to use less memory?

    – Cloud Cho
    Mar 26 at 0:04













  • Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.

    – Chocolate
    Apr 3 at 21:25











  • I see your set buffere_size as 3,000. Have you tried smaller number like 3? Also what dataset dimension and shape?

    – Cloud Cho
    Apr 3 at 22:25











  • You should give more code like process_fn and batch_size so that we can reproduce the problem.

    – giser_yugang
    Apr 4 at 2:11











  • batch_size is 128

    – Chocolate
    Apr 6 at 21:53
















3












3








3








My model consumes chronologically ordered sequences within each input batch. Therefore, i am creating batches before shuffling my input data. This brings the issue that batches always include the same data samples across the whole dataset (starting with the same indices - shifted by batch_size), i solved this issue by caching the initial dataset and sampling from skipped datasets, however this eats up memory pretty fast (though my dataset has only 150MB):



dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))
dataset = dataset.map(process_fn, num_parallel_calls=8)
dataset = dataset.cache()
datasets =
for i in range(0, batch_size):
d = dataset.skip(i)
d = d.batch(batch_size, drop_remainder=True)
datasets.append(d)
dataset = tf.data.experimental.sample_from_datasets(datasets)
dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
dataset = dataset.repeat()


Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.










share|improve this question
















My model consumes chronologically ordered sequences within each input batch. Therefore, i am creating batches before shuffling my input data. This brings the issue that batches always include the same data samples across the whole dataset (starting with the same indices - shifted by batch_size), i solved this issue by caching the initial dataset and sampling from skipped datasets, however this eats up memory pretty fast (though my dataset has only 150MB):



dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))
dataset = dataset.map(process_fn, num_parallel_calls=8)
dataset = dataset.cache()
datasets =
for i in range(0, batch_size):
d = dataset.skip(i)
d = d.batch(batch_size, drop_remainder=True)
datasets.append(d)
dataset = tf.data.experimental.sample_from_datasets(datasets)
dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
dataset = dataset.repeat()


Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.







python tensorflow tensorflow-datasets tensorflow-estimator






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 28 '18 at 10:11







Chocolate

















asked Nov 23 '18 at 19:14









ChocolateChocolate

1091212




1091212













  • Do you find a better way to use less memory?

    – Cloud Cho
    Mar 26 at 0:04













  • Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.

    – Chocolate
    Apr 3 at 21:25











  • I see your set buffere_size as 3,000. Have you tried smaller number like 3? Also what dataset dimension and shape?

    – Cloud Cho
    Apr 3 at 22:25











  • You should give more code like process_fn and batch_size so that we can reproduce the problem.

    – giser_yugang
    Apr 4 at 2:11











  • batch_size is 128

    – Chocolate
    Apr 6 at 21:53





















  • Do you find a better way to use less memory?

    – Cloud Cho
    Mar 26 at 0:04













  • Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.

    – Chocolate
    Apr 3 at 21:25











  • I see your set buffere_size as 3,000. Have you tried smaller number like 3? Also what dataset dimension and shape?

    – Cloud Cho
    Apr 3 at 22:25











  • You should give more code like process_fn and batch_size so that we can reproduce the problem.

    – giser_yugang
    Apr 4 at 2:11











  • batch_size is 128

    – Chocolate
    Apr 6 at 21:53



















Do you find a better way to use less memory?

– Cloud Cho
Mar 26 at 0:04







Do you find a better way to use less memory?

– Cloud Cho
Mar 26 at 0:04















Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.

– Chocolate
Apr 3 at 21:25





Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.

– Chocolate
Apr 3 at 21:25













I see your set buffere_size as 3,000. Have you tried smaller number like 3? Also what dataset dimension and shape?

– Cloud Cho
Apr 3 at 22:25





I see your set buffere_size as 3,000. Have you tried smaller number like 3? Also what dataset dimension and shape?

– Cloud Cho
Apr 3 at 22:25













You should give more code like process_fn and batch_size so that we can reproduce the problem.

– giser_yugang
Apr 4 at 2:11





You should give more code like process_fn and batch_size so that we can reproduce the problem.

– giser_yugang
Apr 4 at 2:11













batch_size is 128

– Chocolate
Apr 6 at 21:53







batch_size is 128

– Chocolate
Apr 6 at 21:53














1 Answer
1






active

oldest

votes


















0














You are eating up memory because you are shuffling entire batches -- also skipping may not be very efficient. Since your data seems to be entire in memory, you could possibly sample your data directly in python without too much concern about performance:



def make_batch(start_idx):
batch = np.empty((batch_size, window_size), dtype=data.dtype)
for batch_idx, data_idx in enumerate(
range(start_idx, start_idx + window_shift * batch_size, window_shift)):
batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]
return batch

dataset = (tf.data.Dataset
.range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))
.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
.map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32
.repeat()
.prefetch(1)) # you might want to consider prefetching for performance


The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.






share|improve this answer
























  • Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

    – Chocolate
    Apr 6 at 21:52














Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451927%2ftensorflow-tf-data-dataset-randomly-skip-samples-before-batching-to-get-differ%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














You are eating up memory because you are shuffling entire batches -- also skipping may not be very efficient. Since your data seems to be entire in memory, you could possibly sample your data directly in python without too much concern about performance:



def make_batch(start_idx):
batch = np.empty((batch_size, window_size), dtype=data.dtype)
for batch_idx, data_idx in enumerate(
range(start_idx, start_idx + window_shift * batch_size, window_shift)):
batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]
return batch

dataset = (tf.data.Dataset
.range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))
.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
.map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32
.repeat()
.prefetch(1)) # you might want to consider prefetching for performance


The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.






share|improve this answer
























  • Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

    – Chocolate
    Apr 6 at 21:52


















0














You are eating up memory because you are shuffling entire batches -- also skipping may not be very efficient. Since your data seems to be entire in memory, you could possibly sample your data directly in python without too much concern about performance:



def make_batch(start_idx):
batch = np.empty((batch_size, window_size), dtype=data.dtype)
for batch_idx, data_idx in enumerate(
range(start_idx, start_idx + window_shift * batch_size, window_shift)):
batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]
return batch

dataset = (tf.data.Dataset
.range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))
.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
.map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32
.repeat()
.prefetch(1)) # you might want to consider prefetching for performance


The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.






share|improve this answer
























  • Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

    – Chocolate
    Apr 6 at 21:52
















0












0








0







You are eating up memory because you are shuffling entire batches -- also skipping may not be very efficient. Since your data seems to be entire in memory, you could possibly sample your data directly in python without too much concern about performance:



def make_batch(start_idx):
batch = np.empty((batch_size, window_size), dtype=data.dtype)
for batch_idx, data_idx in enumerate(
range(start_idx, start_idx + window_shift * batch_size, window_shift)):
batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]
return batch

dataset = (tf.data.Dataset
.range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))
.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
.map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32
.repeat()
.prefetch(1)) # you might want to consider prefetching for performance


The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.






share|improve this answer













You are eating up memory because you are shuffling entire batches -- also skipping may not be very efficient. Since your data seems to be entire in memory, you could possibly sample your data directly in python without too much concern about performance:



def make_batch(start_idx):
batch = np.empty((batch_size, window_size), dtype=data.dtype)
for batch_idx, data_idx in enumerate(
range(start_idx, start_idx + window_shift * batch_size, window_shift)):
batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]
return batch

dataset = (tf.data.Dataset
.range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))
.shuffle(buffer_size=30000, reshuffle_each_iteration=False)
.map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32
.repeat()
.prefetch(1)) # you might want to consider prefetching for performance


The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.







share|improve this answer












share|improve this answer



share|improve this answer










answered Apr 4 at 7:03









P-GnP-Gn

12.6k13768




12.6k13768













  • Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

    – Chocolate
    Apr 6 at 21:52





















  • Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

    – Chocolate
    Apr 6 at 21:52



















Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

– Chocolate
Apr 6 at 21:52







Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

– Chocolate
Apr 6 at 21:52






















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451927%2ftensorflow-tf-data-dataset-randomly-skip-samples-before-batching-to-get-differ%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini