Enforcing an interface on a DataFrame
I'm new to Spark and was wondering if the following is possible.
I have 2 Datasets, and they both have fields EventTime and UserId. However, they differ in all other columns.
I want to write a function that takes in these Datasets and spits out the last time I saw each user.
This is easy enough, because we can select the row with the maximum time for each user (groupby)
Let's say I have a function LastSeenTime(events: DataFrame): DataFrame { ... }
My question is how would you organize the code, and potentially define a type/interface such that LastSeenTime can enforce that events has the UserId and EventTime columns it needs to do the processing.
Can Dataset Schema's conform to partial interfaces?
Thanks!
scala apache-spark-dataset
add a comment |
I'm new to Spark and was wondering if the following is possible.
I have 2 Datasets, and they both have fields EventTime and UserId. However, they differ in all other columns.
I want to write a function that takes in these Datasets and spits out the last time I saw each user.
This is easy enough, because we can select the row with the maximum time for each user (groupby)
Let's say I have a function LastSeenTime(events: DataFrame): DataFrame { ... }
My question is how would you organize the code, and potentially define a type/interface such that LastSeenTime can enforce that events has the UserId and EventTime columns it needs to do the processing.
Can Dataset Schema's conform to partial interfaces?
Thanks!
scala apache-spark-dataset
add a comment |
I'm new to Spark and was wondering if the following is possible.
I have 2 Datasets, and they both have fields EventTime and UserId. However, they differ in all other columns.
I want to write a function that takes in these Datasets and spits out the last time I saw each user.
This is easy enough, because we can select the row with the maximum time for each user (groupby)
Let's say I have a function LastSeenTime(events: DataFrame): DataFrame { ... }
My question is how would you organize the code, and potentially define a type/interface such that LastSeenTime can enforce that events has the UserId and EventTime columns it needs to do the processing.
Can Dataset Schema's conform to partial interfaces?
Thanks!
scala apache-spark-dataset
I'm new to Spark and was wondering if the following is possible.
I have 2 Datasets, and they both have fields EventTime and UserId. However, they differ in all other columns.
I want to write a function that takes in these Datasets and spits out the last time I saw each user.
This is easy enough, because we can select the row with the maximum time for each user (groupby)
Let's say I have a function LastSeenTime(events: DataFrame): DataFrame { ... }
My question is how would you organize the code, and potentially define a type/interface such that LastSeenTime can enforce that events has the UserId and EventTime columns it needs to do the processing.
Can Dataset Schema's conform to partial interfaces?
Thanks!
scala apache-spark-dataset
scala apache-spark-dataset
asked Nov 14 '18 at 23:18
Vishaal KalwaniVishaal Kalwani
3821313
3821313
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
You can make something like this:
sealed trait Event {
def userId: String
def eventTime: String
}
final case class UserEvent(userId: String, eventTime: String, otherField: String) extends Event
def lastTimeByUser[E <: Event, T](events: Dataset[E]): Dataset[T] = ???
Edit
If you're using a Dataframe, you can "cast" it to a Dataset[T] using the .as[T] method. (Where T is the case class you want to use for represent your data - must have the same fields of your Rows).
Note, you will need a implicit Encoder[T] in the scope for that - the simplest way to provide it is import spark.implicits._, where spark is an instance of SparkSession.
This looks like just what I need. Can you explain what the syntaxDataset[Event]does? Is it some sort of templating ?
– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
To elaborate on this.Dataframeis a type alias forDataset[Row]andRowis just a glorifiedMap[String, Any]andDataframebasically just has a bunch of additional syntax tacked to it. Personally I'd recommentdef lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???because that will not lose you type information when calling the function with aDataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming aDataframeinto aDatasetfor clarification. Please feel free to edit it if you believe you can be more consince than me.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310215%2fenforcing-an-interface-on-a-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can make something like this:
sealed trait Event {
def userId: String
def eventTime: String
}
final case class UserEvent(userId: String, eventTime: String, otherField: String) extends Event
def lastTimeByUser[E <: Event, T](events: Dataset[E]): Dataset[T] = ???
Edit
If you're using a Dataframe, you can "cast" it to a Dataset[T] using the .as[T] method. (Where T is the case class you want to use for represent your data - must have the same fields of your Rows).
Note, you will need a implicit Encoder[T] in the scope for that - the simplest way to provide it is import spark.implicits._, where spark is an instance of SparkSession.
This looks like just what I need. Can you explain what the syntaxDataset[Event]does? Is it some sort of templating ?
– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
To elaborate on this.Dataframeis a type alias forDataset[Row]andRowis just a glorifiedMap[String, Any]andDataframebasically just has a bunch of additional syntax tacked to it. Personally I'd recommentdef lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???because that will not lose you type information when calling the function with aDataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming aDataframeinto aDatasetfor clarification. Please feel free to edit it if you believe you can be more consince than me.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
add a comment |
You can make something like this:
sealed trait Event {
def userId: String
def eventTime: String
}
final case class UserEvent(userId: String, eventTime: String, otherField: String) extends Event
def lastTimeByUser[E <: Event, T](events: Dataset[E]): Dataset[T] = ???
Edit
If you're using a Dataframe, you can "cast" it to a Dataset[T] using the .as[T] method. (Where T is the case class you want to use for represent your data - must have the same fields of your Rows).
Note, you will need a implicit Encoder[T] in the scope for that - the simplest way to provide it is import spark.implicits._, where spark is an instance of SparkSession.
This looks like just what I need. Can you explain what the syntaxDataset[Event]does? Is it some sort of templating ?
– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
To elaborate on this.Dataframeis a type alias forDataset[Row]andRowis just a glorifiedMap[String, Any]andDataframebasically just has a bunch of additional syntax tacked to it. Personally I'd recommentdef lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???because that will not lose you type information when calling the function with aDataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming aDataframeinto aDatasetfor clarification. Please feel free to edit it if you believe you can be more consince than me.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
add a comment |
You can make something like this:
sealed trait Event {
def userId: String
def eventTime: String
}
final case class UserEvent(userId: String, eventTime: String, otherField: String) extends Event
def lastTimeByUser[E <: Event, T](events: Dataset[E]): Dataset[T] = ???
Edit
If you're using a Dataframe, you can "cast" it to a Dataset[T] using the .as[T] method. (Where T is the case class you want to use for represent your data - must have the same fields of your Rows).
Note, you will need a implicit Encoder[T] in the scope for that - the simplest way to provide it is import spark.implicits._, where spark is an instance of SparkSession.
You can make something like this:
sealed trait Event {
def userId: String
def eventTime: String
}
final case class UserEvent(userId: String, eventTime: String, otherField: String) extends Event
def lastTimeByUser[E <: Event, T](events: Dataset[E]): Dataset[T] = ???
Edit
If you're using a Dataframe, you can "cast" it to a Dataset[T] using the .as[T] method. (Where T is the case class you want to use for represent your data - must have the same fields of your Rows).
Note, you will need a implicit Encoder[T] in the scope for that - the simplest way to provide it is import spark.implicits._, where spark is an instance of SparkSession.
edited Nov 15 '18 at 14:54
answered Nov 15 '18 at 0:16
Luis Miguel Mejía SuárezLuis Miguel Mejía Suárez
2,6361822
2,6361822
This looks like just what I need. Can you explain what the syntaxDataset[Event]does? Is it some sort of templating ?
– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
To elaborate on this.Dataframeis a type alias forDataset[Row]andRowis just a glorifiedMap[String, Any]andDataframebasically just has a bunch of additional syntax tacked to it. Personally I'd recommentdef lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???because that will not lose you type information when calling the function with aDataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming aDataframeinto aDatasetfor clarification. Please feel free to edit it if you believe you can be more consince than me.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
add a comment |
This looks like just what I need. Can you explain what the syntaxDataset[Event]does? Is it some sort of templating ?
– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
To elaborate on this.Dataframeis a type alias forDataset[Row]andRowis just a glorifiedMap[String, Any]andDataframebasically just has a bunch of additional syntax tacked to it. Personally I'd recommentdef lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ???because that will not lose you type information when calling the function with aDataset[UserEvent]
– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming aDataframeinto aDatasetfor clarification. Please feel free to edit it if you believe you can be more consince than me.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
This looks like just what I need. Can you explain what the syntax
Dataset[Event] does? Is it some sort of templating ?– Vishaal Kalwani
Nov 15 '18 at 0:23
This looks like just what I need. Can you explain what the syntax
Dataset[Event] does? Is it some sort of templating ?– Vishaal Kalwani
Nov 15 '18 at 0:23
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
@VishaalKalwani Yes, is just a normal Scala generics. It means is the method accepts as input a Dataset of Events.
– Luis Miguel Mejía Suárez
Nov 15 '18 at 0:40
1
1
To elaborate on this.
Dataframe is a type alias for Dataset[Row] and Row is just a glorified Map[String, Any] and Dataframe basically just has a bunch of additional syntax tacked to it. Personally I'd recomment def lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ??? because that will not lose you type information when calling the function with a Dataset[UserEvent]– Dominic Egger
Nov 15 '18 at 8:17
To elaborate on this.
Dataframe is a type alias for Dataset[Row] and Row is just a glorified Map[String, Any] and Dataframe basically just has a bunch of additional syntax tacked to it. Personally I'd recomment def lastTimeByUser[T <: Event](events: Dataset[T]): Dataset[T] = ??? because that will not lose you type information when calling the function with a Dataset[UserEvent]– Dominic Egger
Nov 15 '18 at 8:17
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming a
Dataframe into a Dataset for clarification. Please feel free to edit it if you believe you can be more consince than me.– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
@DominicEgger, yes you're right - I have updated the answer. I also provided information about transforming a
Dataframe into a Dataset for clarification. Please feel free to edit it if you believe you can be more consince than me.– Luis Miguel Mejía Suárez
Nov 15 '18 at 14:57
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310215%2fenforcing-an-interface-on-a-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown