Why the result is empty after merging two streams in Spark 2.4?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















I use Spark 2.4 in order to join two streams. The problem is that the result is empty.



I load streaming data from folders:



data/1



[
{"id1": 77,"name1": "aaa","timestamp": 1532609003},
{"id1": 77,"name1": "xxx","timestamp": 1532609005},
{"id1": 78,"name1": "xxx","timestamp": 1532609005}
]


data/2



[
{"id2": 77,"name2": "yyy", "timestamp2": 1532609000}
]


My code:



schema1 = StructType([
StructField("id1", IntegerType()),
StructField("name1", StringType()),
StructField("timestamp1", TimestampType()))
])

schema2 = StructType([
StructField("id2", IntegerType()),
StructField("name2", StringType()),
StructField("timestamp2", TimestampType()))
])

ds1 = spark
.readStream
.format("json")
.schema(schema1)
.load("data/1")
.withWatermark("timestamp1", "2 minutes")

ds2 = spark
.readStream
.format("json")
.schema(schema2)
.load("data/2")
.withWatermark("timestamp2", "2 minutes")

ds_joined = ds1.join(
ds2,
func.expr("""
id1 = id2 AND
timestamp1 >= timestamp2 AND
timestamp1 <= timestamp2 + interval 2 minutes
"""),
"leftOuter"
).fillna(0)

query = ds_joined
.writeStream
.format('console')
.start()

query.awaitTermination()


As it can be seen, I used a watermark of 2 minutes. Therefore I do not understand my I get an empty joined dataset.



Expected output:



id1  id2  name1  name2  timestamp1  timestamp2
77 77 aaa yyy 1532609003 1532609000
77 77 xxx yyy 1532609005 1532609000
78 0 xxx 0 1532609005 0









share|improve this question























  • Ignoring syntax errors, which I assume are not present in you original code a) Input is not a valid JSONL, so you should use multiline option b) schema doesn't match the data, as integer is not valid input for timestamp c) even if you considered values as epoch timestamps and casted directly (StructField("timestamp1", "long")) and col(""timestamp1").cast("timestamp")), the latest value is July or, so would be discarded with 2 minute watermark. There might be some other problem as well, not obvious at first glance.

    – user6910411
    Nov 24 '18 at 21:31




















0















I use Spark 2.4 in order to join two streams. The problem is that the result is empty.



I load streaming data from folders:



data/1



[
{"id1": 77,"name1": "aaa","timestamp": 1532609003},
{"id1": 77,"name1": "xxx","timestamp": 1532609005},
{"id1": 78,"name1": "xxx","timestamp": 1532609005}
]


data/2



[
{"id2": 77,"name2": "yyy", "timestamp2": 1532609000}
]


My code:



schema1 = StructType([
StructField("id1", IntegerType()),
StructField("name1", StringType()),
StructField("timestamp1", TimestampType()))
])

schema2 = StructType([
StructField("id2", IntegerType()),
StructField("name2", StringType()),
StructField("timestamp2", TimestampType()))
])

ds1 = spark
.readStream
.format("json")
.schema(schema1)
.load("data/1")
.withWatermark("timestamp1", "2 minutes")

ds2 = spark
.readStream
.format("json")
.schema(schema2)
.load("data/2")
.withWatermark("timestamp2", "2 minutes")

ds_joined = ds1.join(
ds2,
func.expr("""
id1 = id2 AND
timestamp1 >= timestamp2 AND
timestamp1 <= timestamp2 + interval 2 minutes
"""),
"leftOuter"
).fillna(0)

query = ds_joined
.writeStream
.format('console')
.start()

query.awaitTermination()


As it can be seen, I used a watermark of 2 minutes. Therefore I do not understand my I get an empty joined dataset.



Expected output:



id1  id2  name1  name2  timestamp1  timestamp2
77 77 aaa yyy 1532609003 1532609000
77 77 xxx yyy 1532609005 1532609000
78 0 xxx 0 1532609005 0









share|improve this question























  • Ignoring syntax errors, which I assume are not present in you original code a) Input is not a valid JSONL, so you should use multiline option b) schema doesn't match the data, as integer is not valid input for timestamp c) even if you considered values as epoch timestamps and casted directly (StructField("timestamp1", "long")) and col(""timestamp1").cast("timestamp")), the latest value is July or, so would be discarded with 2 minute watermark. There might be some other problem as well, not obvious at first glance.

    – user6910411
    Nov 24 '18 at 21:31
















0












0








0








I use Spark 2.4 in order to join two streams. The problem is that the result is empty.



I load streaming data from folders:



data/1



[
{"id1": 77,"name1": "aaa","timestamp": 1532609003},
{"id1": 77,"name1": "xxx","timestamp": 1532609005},
{"id1": 78,"name1": "xxx","timestamp": 1532609005}
]


data/2



[
{"id2": 77,"name2": "yyy", "timestamp2": 1532609000}
]


My code:



schema1 = StructType([
StructField("id1", IntegerType()),
StructField("name1", StringType()),
StructField("timestamp1", TimestampType()))
])

schema2 = StructType([
StructField("id2", IntegerType()),
StructField("name2", StringType()),
StructField("timestamp2", TimestampType()))
])

ds1 = spark
.readStream
.format("json")
.schema(schema1)
.load("data/1")
.withWatermark("timestamp1", "2 minutes")

ds2 = spark
.readStream
.format("json")
.schema(schema2)
.load("data/2")
.withWatermark("timestamp2", "2 minutes")

ds_joined = ds1.join(
ds2,
func.expr("""
id1 = id2 AND
timestamp1 >= timestamp2 AND
timestamp1 <= timestamp2 + interval 2 minutes
"""),
"leftOuter"
).fillna(0)

query = ds_joined
.writeStream
.format('console')
.start()

query.awaitTermination()


As it can be seen, I used a watermark of 2 minutes. Therefore I do not understand my I get an empty joined dataset.



Expected output:



id1  id2  name1  name2  timestamp1  timestamp2
77 77 aaa yyy 1532609003 1532609000
77 77 xxx yyy 1532609005 1532609000
78 0 xxx 0 1532609005 0









share|improve this question














I use Spark 2.4 in order to join two streams. The problem is that the result is empty.



I load streaming data from folders:



data/1



[
{"id1": 77,"name1": "aaa","timestamp": 1532609003},
{"id1": 77,"name1": "xxx","timestamp": 1532609005},
{"id1": 78,"name1": "xxx","timestamp": 1532609005}
]


data/2



[
{"id2": 77,"name2": "yyy", "timestamp2": 1532609000}
]


My code:



schema1 = StructType([
StructField("id1", IntegerType()),
StructField("name1", StringType()),
StructField("timestamp1", TimestampType()))
])

schema2 = StructType([
StructField("id2", IntegerType()),
StructField("name2", StringType()),
StructField("timestamp2", TimestampType()))
])

ds1 = spark
.readStream
.format("json")
.schema(schema1)
.load("data/1")
.withWatermark("timestamp1", "2 minutes")

ds2 = spark
.readStream
.format("json")
.schema(schema2)
.load("data/2")
.withWatermark("timestamp2", "2 minutes")

ds_joined = ds1.join(
ds2,
func.expr("""
id1 = id2 AND
timestamp1 >= timestamp2 AND
timestamp1 <= timestamp2 + interval 2 minutes
"""),
"leftOuter"
).fillna(0)

query = ds_joined
.writeStream
.format('console')
.start()

query.awaitTermination()


As it can be seen, I used a watermark of 2 minutes. Therefore I do not understand my I get an empty joined dataset.



Expected output:



id1  id2  name1  name2  timestamp1  timestamp2
77 77 aaa yyy 1532609003 1532609000
77 77 xxx yyy 1532609005 1532609000
78 0 xxx 0 1532609005 0






python apache-spark pyspark spark-structured-streaming






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 24 '18 at 21:14









MozimakiMozimaki

524




524













  • Ignoring syntax errors, which I assume are not present in you original code a) Input is not a valid JSONL, so you should use multiline option b) schema doesn't match the data, as integer is not valid input for timestamp c) even if you considered values as epoch timestamps and casted directly (StructField("timestamp1", "long")) and col(""timestamp1").cast("timestamp")), the latest value is July or, so would be discarded with 2 minute watermark. There might be some other problem as well, not obvious at first glance.

    – user6910411
    Nov 24 '18 at 21:31





















  • Ignoring syntax errors, which I assume are not present in you original code a) Input is not a valid JSONL, so you should use multiline option b) schema doesn't match the data, as integer is not valid input for timestamp c) even if you considered values as epoch timestamps and casted directly (StructField("timestamp1", "long")) and col(""timestamp1").cast("timestamp")), the latest value is July or, so would be discarded with 2 minute watermark. There might be some other problem as well, not obvious at first glance.

    – user6910411
    Nov 24 '18 at 21:31



















Ignoring syntax errors, which I assume are not present in you original code a) Input is not a valid JSONL, so you should use multiline option b) schema doesn't match the data, as integer is not valid input for timestamp c) even if you considered values as epoch timestamps and casted directly (StructField("timestamp1", "long")) and col(""timestamp1").cast("timestamp")), the latest value is July or, so would be discarded with 2 minute watermark. There might be some other problem as well, not obvious at first glance.

– user6910411
Nov 24 '18 at 21:31







Ignoring syntax errors, which I assume are not present in you original code a) Input is not a valid JSONL, so you should use multiline option b) schema doesn't match the data, as integer is not valid input for timestamp c) even if you considered values as epoch timestamps and casted directly (StructField("timestamp1", "long")) and col(""timestamp1").cast("timestamp")), the latest value is July or, so would be discarded with 2 minute watermark. There might be some other problem as well, not obvious at first glance.

– user6910411
Nov 24 '18 at 21:31














0






active

oldest

votes












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53462432%2fwhy-the-result-is-empty-after-merging-two-streams-in-spark-2-4%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53462432%2fwhy-the-result-is-empty-after-merging-two-streams-in-spark-2-4%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Xamarin.form Move up view when keyboard appear

Post-Redirect-Get with Spring WebFlux and Thymeleaf

Anylogic : not able to use stopDelay()