tensorflow - tf.data.Dataset randomly skip samples before batching to get different batches

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

My model consumes chronologically ordered sequences within each input batch. Therefore, i am creating batches before shuffling my input data. This brings the issue that batches always include the same data samples across the whole dataset (starting with the same indices - shifted by batch_size), i solved this issue by caching the initial dataset and sampling from skipped datasets, however this eats up memory pretty fast (though my dataset has only 150MB):

dataset = tf.data.Dataset.from_tensor_slices(data)

dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))

dataset = dataset.map(process_fn, num_parallel_calls=8)

dataset = dataset.cache()

datasets = 

for i in range(0, batch_size):

    d = dataset.skip(i)

    d = d.batch(batch_size, drop_remainder=True)

    datasets.append(d)

dataset = tf.data.experimental.sample_from_datasets(datasets)

dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)

dataset = dataset.repeat()

Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.

edited Nov 28 '18 at 10:11

asked Nov 23 '18 at 19:14

Chocolate

1091212

Do you find a better way to use less memory?

– Cloud Cho
Mar 26 at 0:04

Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.

– Chocolate
Apr 3 at 21:25

I see your set buffere_size as 3,000. Have you tried smaller number like 3? Also what dataset dimension and shape?

– Cloud Cho
Apr 3 at 22:25

You should give more code like process_fn and batch_size so that we can reproduce the problem.

– giser_yugang
Apr 4 at 2:11

batch_size is 128

– Chocolate
Apr 6 at 21:53

add a comment |

dataset = tf.data.Dataset.from_tensor_slices(data)

dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))

dataset = dataset.map(process_fn, num_parallel_calls=8)

dataset = dataset.cache()

datasets = 

for i in range(0, batch_size):

    d = dataset.skip(i)

    d = d.batch(batch_size, drop_remainder=True)

    datasets.append(d)

dataset = tf.data.experimental.sample_from_datasets(datasets)

dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)

dataset = dataset.repeat()

Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.

edited Nov 28 '18 at 10:11

asked Nov 23 '18 at 19:14

Chocolate

1091212

Do you find a better way to use less memory?

– Cloud Cho
Mar 26 at 0:04

Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.

– Chocolate
Apr 3 at 21:25

I see your set buffere_size as 3,000. Have you tried smaller number like 3? Also what dataset dimension and shape?

– Cloud Cho
Apr 3 at 22:25

You should give more code like process_fn and batch_size so that we can reproduce the problem.

– giser_yugang
Apr 4 at 2:11

batch_size is 128

– Chocolate
Apr 6 at 21:53

add a comment |

dataset = tf.data.Dataset.from_tensor_slices(data)

dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))

dataset = dataset.map(process_fn, num_parallel_calls=8)

dataset = dataset.cache()

datasets = 

for i in range(0, batch_size):

    d = dataset.skip(i)

    d = d.batch(batch_size, drop_remainder=True)

    datasets.append(d)

dataset = tf.data.experimental.sample_from_datasets(datasets)

dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)

dataset = dataset.repeat()

Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.

edited Nov 28 '18 at 10:11

asked Nov 23 '18 at 19:14

Chocolate

1091212

dataset = tf.data.Dataset.from_tensor_slices(data)

dataset = dataset.window(size=window_size, shift=window_shift, stride=window_stride, drop_remainder=True).flat_map(lambda x: x.batch(window_size))

dataset = dataset.map(process_fn, num_parallel_calls=8)

dataset = dataset.cache()

datasets = 

for i in range(0, batch_size):

    d = dataset.skip(i)

    d = d.batch(batch_size, drop_remainder=True)

    datasets.append(d)

dataset = tf.data.experimental.sample_from_datasets(datasets)

dataset = dataset.shuffle(buffer_size=30000, reshuffle_each_iteration=False)

dataset = dataset.repeat()

Is there another way to achieve this behaviour? I want to cover all possible indices for the start of the first sequence inside a batch.

python tensorflow tensorflow-datasets tensorflow-estimator

edited Nov 28 '18 at 10:11

asked Nov 23 '18 at 19:14

Chocolate

1091212

edited Nov 28 '18 at 10:11

asked Nov 23 '18 at 19:14

Chocolate

1091212

edited Nov 28 '18 at 10:11

asked Nov 23 '18 at 19:14

Chocolate

1091212

asked Nov 23 '18 at 19:14

Chocolate

1091212

asked Nov 23 '18 at 19:14

Chocolate

1091212

Do you find a better way to use less memory?

– Cloud Cho
Mar 26 at 0:04

Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.

– Chocolate
Apr 3 at 21:25

I see your set buffere_size as 3,000. Have you tried smaller number like 3? Also what dataset dimension and shape?

– Cloud Cho
Apr 3 at 22:25

You should give more code like process_fn and batch_size so that we can reproduce the problem.

– giser_yugang
Apr 4 at 2:11

batch_size is 128

– Chocolate
Apr 6 at 21:53

add a comment |

Do you find a better way to use less memory?

– Cloud Cho
Mar 26 at 0:04

Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.

– Chocolate
Apr 3 at 21:25

I see your set buffere_size as 3,000. Have you tried smaller number like 3? Also what dataset dimension and shape?

– Cloud Cho
Apr 3 at 22:25

You should give more code like process_fn and batch_size so that we can reproduce the problem.

– giser_yugang
Apr 4 at 2:11

batch_size is 128

– Chocolate
Apr 6 at 21:53

Do you find a better way to use less memory?

– Cloud Cho
Mar 26 at 0:04

Sadly, i did not get to refactor this section of my code yet. I am not sure if there is a better way now.

– Chocolate
Apr 3 at 21:25

I see your set buffere_size as 3,000. Have you tried smaller number like 3? Also what dataset dimension and shape?

– Cloud Cho
Apr 3 at 22:25

You should give more code like process_fn and batch_size so that we can reproduce the problem.

– giser_yugang
Apr 4 at 2:11

batch_size is 128

– Chocolate
Apr 6 at 21:53

add a comment |

1 Answer
1

active

oldest

votes

You are eating up memory because you are shuffling entire batches -- also skipping may not be very efficient. Since your data seems to be entire in memory, you could possibly sample your data directly in python without too much concern about performance:

def make_batch(start_idx):

  batch = np.empty((batch_size, window_size), dtype=data.dtype)

  for batch_idx, data_idx in enumerate(

      range(start_idx, start_idx + window_shift * batch_size, window_shift)):

    batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]

  return batch



dataset = (tf.data.Dataset

  .range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))

  .shuffle(buffer_size=30000, reshuffle_each_iteration=False)

  .map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32

  .repeat()

  .prefetch(1)) # you might want to consider prefetching for performance

The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.

answered Apr 4 at 7:03

P-Gn

12.6k13768

Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

– Chocolate
Apr 6 at 21:52

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451927%2ftensorflow-tf-data-dataset-randomly-skip-samples-before-batching-to-get-differ%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

def make_batch(start_idx):

  batch = np.empty((batch_size, window_size), dtype=data.dtype)

  for batch_idx, data_idx in enumerate(

      range(start_idx, start_idx + window_shift * batch_size, window_shift)):

    batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]

  return batch



dataset = (tf.data.Dataset

  .range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))

  .shuffle(buffer_size=30000, reshuffle_each_iteration=False)

  .map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32

  .repeat()

  .prefetch(1)) # you might want to consider prefetching for performance

The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.

answered Apr 4 at 7:03

P-Gn

12.6k13768

Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

– Chocolate
Apr 6 at 21:52

add a comment |

def make_batch(start_idx):

  batch = np.empty((batch_size, window_size), dtype=data.dtype)

  for batch_idx, data_idx in enumerate(

      range(start_idx, start_idx + window_shift * batch_size, window_shift)):

    batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]

  return batch



dataset = (tf.data.Dataset

  .range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))

  .shuffle(buffer_size=30000, reshuffle_each_iteration=False)

  .map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32

  .repeat()

  .prefetch(1)) # you might want to consider prefetching for performance

The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.

answered Apr 4 at 7:03

P-Gn

12.6k13768

Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

– Chocolate
Apr 6 at 21:52

add a comment |

def make_batch(start_idx):

  batch = np.empty((batch_size, window_size), dtype=data.dtype)

  for batch_idx, data_idx in enumerate(

      range(start_idx, start_idx + window_shift * batch_size, window_shift)):

    batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]

  return batch



dataset = (tf.data.Dataset

  .range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))

  .shuffle(buffer_size=30000, reshuffle_each_iteration=False)

  .map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32

  .repeat()

  .prefetch(1)) # you might want to consider prefetching for performance

The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.

answered Apr 4 at 7:03

P-Gn

12.6k13768

def make_batch(start_idx):

  batch = np.empty((batch_size, window_size), dtype=data.dtype)

  for batch_idx, data_idx in enumerate(

      range(start_idx, start_idx + window_shift * batch_size, window_shift)):

    batch[batch_idx] = data[data_idx:data_idx + window_size * window_stride:window_stride]

  return batch



dataset = (tf.data.Dataset

  .range(len(data) - window_stride * (window_size - 1) - window_shift * (batch_size- 1))

  .shuffle(buffer_size=30000, reshuffle_each_iteration=False)

  .map(lambda x: tf.py_func(make_batch, [x], tf.float32)) # assuming your data is float32

  .repeat()

  .prefetch(1)) # you might want to consider prefetching for performance

The shuffling now occurs on indices, not on entire batches, so it has a much lower memory footprint.

answered Apr 4 at 7:03

P-Gn

12.6k13768

answered Apr 4 at 7:03

P-Gn

12.6k13768

answered Apr 4 at 7:03

P-Gn

12.6k13768

answered Apr 4 at 7:03

P-Gn

12.6k13768

Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

– Chocolate
Apr 6 at 21:52

add a comment |

Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

– Chocolate
Apr 6 at 21:52

Thanks for your help, I'll look into it. Anyway, i hoped for a pure tensorflow solution to this problem.

– Chocolate
Apr 6 at 21:52

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk