How can I iterate through a column of a spark dataframe and access the values in it one by one?
I have spark dataframe
Here it is
I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question
pyspark apache-spark-sql
add a comment |
I have spark dataframe
Here it is
I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question
pyspark apache-spark-sql
For which column you want to do this?
– karma4917
Nov 13 '18 at 15:36
There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.
– pault
Nov 13 '18 at 15:38
Also, don't post pictures of or links to code/data.
– pault
Nov 13 '18 at 15:41
add a comment |
I have spark dataframe
Here it is
I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question
pyspark apache-spark-sql
I have spark dataframe
Here it is
I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question
pyspark apache-spark-sql
pyspark apache-spark-sql
asked Nov 13 '18 at 14:24
RAM SHANKER GRAM SHANKER G
417
417
For which column you want to do this?
– karma4917
Nov 13 '18 at 15:36
There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.
– pault
Nov 13 '18 at 15:38
Also, don't post pictures of or links to code/data.
– pault
Nov 13 '18 at 15:41
add a comment |
For which column you want to do this?
– karma4917
Nov 13 '18 at 15:36
There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.
– pault
Nov 13 '18 at 15:38
Also, don't post pictures of or links to code/data.
– pault
Nov 13 '18 at 15:41
For which column you want to do this?
– karma4917
Nov 13 '18 at 15:36
For which column you want to do this?
– karma4917
Nov 13 '18 at 15:36
There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.
– pault
Nov 13 '18 at 15:38
There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.
– pault
Nov 13 '18 at 15:38
Also, don't post pictures of or links to code/data.
– pault
Nov 13 '18 at 15:41
Also, don't post pictures of or links to code/data.
– pault
Nov 13 '18 at 15:41
add a comment |
2 Answers
2
active
oldest
votes
I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).
from pyspark.sql import functions as F
var = df.select(F.col('column_you_want')).toPandas()
Then you can iterate on it like a normal pandas series.
No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!
– RAM SHANKER G
Nov 13 '18 at 15:16
add a comment |
col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)
What if there are more number of rows?collect()
operation will be costly right?
– karma4917
Nov 13 '18 at 16:40
repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.
– Avinash
Nov 13 '18 at 17:58
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283172%2fhow-can-i-iterate-through-a-column-of-a-spark-dataframe-and-access-the-values-in%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).
from pyspark.sql import functions as F
var = df.select(F.col('column_you_want')).toPandas()
Then you can iterate on it like a normal pandas series.
No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!
– RAM SHANKER G
Nov 13 '18 at 15:16
add a comment |
I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).
from pyspark.sql import functions as F
var = df.select(F.col('column_you_want')).toPandas()
Then you can iterate on it like a normal pandas series.
No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!
– RAM SHANKER G
Nov 13 '18 at 15:16
add a comment |
I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).
from pyspark.sql import functions as F
var = df.select(F.col('column_you_want')).toPandas()
Then you can iterate on it like a normal pandas series.
I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).
from pyspark.sql import functions as F
var = df.select(F.col('column_you_want')).toPandas()
Then you can iterate on it like a normal pandas series.
answered Nov 13 '18 at 15:14
ManriqueManrique
500113
500113
No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!
– RAM SHANKER G
Nov 13 '18 at 15:16
add a comment |
No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!
– RAM SHANKER G
Nov 13 '18 at 15:16
No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!
– RAM SHANKER G
Nov 13 '18 at 15:16
No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!
– RAM SHANKER G
Nov 13 '18 at 15:16
add a comment |
col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)
What if there are more number of rows?collect()
operation will be costly right?
– karma4917
Nov 13 '18 at 16:40
repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.
– Avinash
Nov 13 '18 at 17:58
add a comment |
col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)
What if there are more number of rows?collect()
operation will be costly right?
– karma4917
Nov 13 '18 at 16:40
repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.
– Avinash
Nov 13 '18 at 17:58
add a comment |
col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)
col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)
answered Nov 13 '18 at 16:25
AvinashAvinash
12
12
What if there are more number of rows?collect()
operation will be costly right?
– karma4917
Nov 13 '18 at 16:40
repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.
– Avinash
Nov 13 '18 at 17:58
add a comment |
What if there are more number of rows?collect()
operation will be costly right?
– karma4917
Nov 13 '18 at 16:40
repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.
– Avinash
Nov 13 '18 at 17:58
What if there are more number of rows?
collect()
operation will be costly right?– karma4917
Nov 13 '18 at 16:40
What if there are more number of rows?
collect()
operation will be costly right?– karma4917
Nov 13 '18 at 16:40
repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.
– Avinash
Nov 13 '18 at 17:58
repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.
– Avinash
Nov 13 '18 at 17:58
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283172%2fhow-can-i-iterate-through-a-column-of-a-spark-dataframe-and-access-the-values-in%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
For which column you want to do this?
– karma4917
Nov 13 '18 at 15:36
There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.
– pault
Nov 13 '18 at 15:38
Also, don't post pictures of or links to code/data.
– pault
Nov 13 '18 at 15:41