Enforcing an interface on a DataFrame
I'm new to Spark and was wondering if the following is possible.
I have 2 Datasets
, and they both have fields EventTime
and UserId
. However, they differ in all other columns.
I want to write a function that takes in these Datasets
and spits out the last time I saw each user.
This is easy enough, because we can select the row with the maximum time for each user (groupby
)
Let's say I have a function LastSeenTime(events: DataFrame): DataFrame { ... }
My question is how would you organize the code, and potentially define a type/interface such that LastSeenTime
can enforce that events
has the UserId
and EventTime
columns it needs to do the processing.
Can Dataset Schema's conform to partial interfaces?
Thanks!
scala apache-spark-dataset
add a comment |
I'm new to Spark and was wondering if the following is possible.
I have 2 Datasets
, and they both have fields EventTime
and UserId
. However, they differ in all other columns.
I want to write a function that takes in these Datasets
and spits out the last time I saw each user.
This is easy enough, because we can select the row with the maximum time for each user (groupby
)
Let's say I have a function LastSeenTime(events: DataFrame): DataFrame { ... }
My question is how would you organize the code, and potentially define a type/interface such that LastSeenTime
can enforce that events
has the UserId
and EventTime
columns it needs to do the processing.
Can Dataset Schema's conform to partial interfaces?
Thanks!
scala apache-spark-dataset
add a comment |
I'm new to Spark and was wondering if the following is possible.
I have 2 Datasets
, and they both have fields EventTime
and UserId
. However, they differ in all other columns.
I want to write a function that takes in these Datasets
and spits out the last time I saw each user.
This is easy enough, because we can select the row with the maximum time for each user (groupby
)
Let's say I have a function LastSeenTime(events: DataFrame): DataFrame { ... }
My question is how would you organize the code, and potentially define a type/interface such that LastSeenTime
can enforce that events
has the UserId
and EventTime
columns it needs to do the processing.
Can Dataset Schema's conform to partial interfaces?
Thanks!
scala apache-spark-dataset
I'm new to Spark and was wondering if the following is possible.
I have 2 Datasets
, and they both have fields EventTime
and UserId
. However, they differ in all other columns.
I want to write a function that takes in these Datasets
and spits out the last time I saw each user.
This is easy enough, because we can select the row with the maximum time for each user (groupby
)
Let's say I have a function LastSeenTime(events: DataFrame): DataFrame { ... }
My question is how would you organize the code, and potentially define a type/interface such that LastSeenTime
can enforce that events
has the UserId
and EventTime
columns it needs to do the processing.
Can Dataset Schema's conform to partial interfaces?
Thanks!
scala apache-spark-dataset
scala apache-spark-dataset
asked Nov 14 '18 at 23:18
Vishaal KalwaniVishaal Kalwani
3821313
3821313
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
You can make something like this:
sealed trait Event {
def userId: String
def eventTime: String
}
final case class UserEvent(userId: String, eventTime: String, otherField: String) extends Event
def lastTimeByUser[E <: Event, T](events: Dataset[E]): Dataset[T] = ???
Edit
If you're using a Dataframe
, you can "cast" it to a Dataset[T]
using the .as[T]
method. (Where T
is the case class you want to use for represent your data - must have the same fields of your Rows
).
Note, you will need a implicit Encoder[T]
in the scope for that - the simplest way to provide it is import spark.implicits._
, where spark
is an instance of SparkSession
.
This looks like just what I need. Can you explain what the syntaxDataset[Event]
does? Is it some sort of templating ?
– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
To elaborate on this.Dataframe
is a type alias forDataset[Row]
andRow
is just a glorifiedMap[String, Any]
andDataframe
basically just has a bunch of additional syntax tacked to it. Personally I'd recommentdef lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???
because that will not lose you type information when calling the function with aDataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming aDataframe
into aDataset
for clarification. Please feel free to edit it if you believe you can be more consince than me.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310215%2fenforcing-an-interface-on-a-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can make something like this:
sealed trait Event {
def userId: String
def eventTime: String
}
final case class UserEvent(userId: String, eventTime: String, otherField: String) extends Event
def lastTimeByUser[E <: Event, T](events: Dataset[E]): Dataset[T] = ???
Edit
If you're using a Dataframe
, you can "cast" it to a Dataset[T]
using the .as[T]
method. (Where T
is the case class you want to use for represent your data - must have the same fields of your Rows
).
Note, you will need a implicit Encoder[T]
in the scope for that - the simplest way to provide it is import spark.implicits._
, where spark
is an instance of SparkSession
.
This looks like just what I need. Can you explain what the syntaxDataset[Event]
does? Is it some sort of templating ?
– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
To elaborate on this.Dataframe
is a type alias forDataset[Row]
andRow
is just a glorifiedMap[String, Any]
andDataframe
basically just has a bunch of additional syntax tacked to it. Personally I'd recommentdef lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???
because that will not lose you type information when calling the function with aDataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming aDataframe
into aDataset
for clarification. Please feel free to edit it if you believe you can be more consince than me.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
add a comment |
You can make something like this:
sealed trait Event {
def userId: String
def eventTime: String
}
final case class UserEvent(userId: String, eventTime: String, otherField: String) extends Event
def lastTimeByUser[E <: Event, T](events: Dataset[E]): Dataset[T] = ???
Edit
If you're using a Dataframe
, you can "cast" it to a Dataset[T]
using the .as[T]
method. (Where T
is the case class you want to use for represent your data - must have the same fields of your Rows
).
Note, you will need a implicit Encoder[T]
in the scope for that - the simplest way to provide it is import spark.implicits._
, where spark
is an instance of SparkSession
.
This looks like just what I need. Can you explain what the syntaxDataset[Event]
does? Is it some sort of templating ?
– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
To elaborate on this.Dataframe
is a type alias forDataset[Row]
andRow
is just a glorifiedMap[String, Any]
andDataframe
basically just has a bunch of additional syntax tacked to it. Personally I'd recommentdef lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???
because that will not lose you type information when calling the function with aDataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming aDataframe
into aDataset
for clarification. Please feel free to edit it if you believe you can be more consince than me.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
add a comment |
You can make something like this:
sealed trait Event {
def userId: String
def eventTime: String
}
final case class UserEvent(userId: String, eventTime: String, otherField: String) extends Event
def lastTimeByUser[E <: Event, T](events: Dataset[E]): Dataset[T] = ???
Edit
If you're using a Dataframe
, you can "cast" it to a Dataset[T]
using the .as[T]
method. (Where T
is the case class you want to use for represent your data - must have the same fields of your Rows
).
Note, you will need a implicit Encoder[T]
in the scope for that - the simplest way to provide it is import spark.implicits._
, where spark
is an instance of SparkSession
.
You can make something like this:
sealed trait Event {
def userId: String
def eventTime: String
}
final case class UserEvent(userId: String, eventTime: String, otherField: String) extends Event
def lastTimeByUser[E <: Event, T](events: Dataset[E]): Dataset[T] = ???
Edit
If you're using a Dataframe
, you can "cast" it to a Dataset[T]
using the .as[T]
method. (Where T
is the case class you want to use for represent your data - must have the same fields of your Rows
).
Note, you will need a implicit Encoder[T]
in the scope for that - the simplest way to provide it is import spark.implicits._
, where spark
is an instance of SparkSession
.
edited Nov 15 '18 at 14:54
answered Nov 15 '18 at 0:16
Luis Miguel Mejía SuárezLuis Miguel Mejía Suárez
2,6361822
2,6361822
This looks like just what I need. Can you explain what the syntaxDataset[Event]
does? Is it some sort of templating ?
– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
To elaborate on this.Dataframe
is a type alias forDataset[Row]
andRow
is just a glorifiedMap[String, Any]
andDataframe
basically just has a bunch of additional syntax tacked to it. Personally I'd recommentdef lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???
because that will not lose you type information when calling the function with aDataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming aDataframe
into aDataset
for clarification. Please feel free to edit it if you believe you can be more consince than me.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
add a comment |
This looks like just what I need. Can you explain what the syntaxDataset[Event]
does? Is it some sort of templating ?
– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
To elaborate on this.Dataframe
is a type alias forDataset[Row]
andRow
is just a glorifiedMap[String, Any]
andDataframe
basically just has a bunch of additional syntax tacked to it. Personally I'd recommentdef lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???
because that will not lose you type information when calling the function with aDataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming aDataframe
into aDataset
for clarification. Please feel free to edit it if you believe you can be more consince than me.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
This looks like just what I need. Can you explain what the syntax
Dataset[Event]
does? Is it some sort of templating ?– Vishaal Kalwani
Nov 15 '18 at 0:23
This looks like just what I need. Can you explain what the syntax
Dataset[Event]
does? Is it some sort of templating ?– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
1
To elaborate on this.
Dataframe
is a type alias for Dataset[Row]
and Row
is just a glorified Map[String, Any]
and Dataframe
basically just has a bunch of additional syntax tacked to it. Personally I'd recomment def lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???
because that will not lose you type information when calling the function with a Dataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
To elaborate on this.
Dataframe
is a type alias for Dataset[Row]
and Row
is just a glorified Map[String, Any]
and Dataframe
basically just has a bunch of additional syntax tacked to it. Personally I'd recomment def lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???
because that will not lose you type information when calling the function with a Dataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming a
Dataframe
into a Dataset
for clarification. Please feel free to edit it if you believe you can be more consince than me.– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming a
Dataframe
into a Dataset
for clarification. Please feel free to edit it if you believe you can be more consince than me.– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310215%2fenforcing-an-interface-on-a-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown