How can I iterate through a column of a spark dataframe and access the values in it one by one?












1















I have spark dataframe
Here it is



I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question










share|improve this question























  • For which column you want to do this?

    – karma4917
    Nov 13 '18 at 15:36











  • There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.

    – pault
    Nov 13 '18 at 15:38











  • Also, don't post pictures of or links to code/data.

    – pault
    Nov 13 '18 at 15:41
















1















I have spark dataframe
Here it is



I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question










share|improve this question























  • For which column you want to do this?

    – karma4917
    Nov 13 '18 at 15:36











  • There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.

    – pault
    Nov 13 '18 at 15:38











  • Also, don't post pictures of or links to code/data.

    – pault
    Nov 13 '18 at 15:41














1












1








1








I have spark dataframe
Here it is



I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question










share|improve this question














I have spark dataframe
Here it is



I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question







pyspark apache-spark-sql






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 14:24









RAM SHANKER GRAM SHANKER G

417




417













  • For which column you want to do this?

    – karma4917
    Nov 13 '18 at 15:36











  • There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.

    – pault
    Nov 13 '18 at 15:38











  • Also, don't post pictures of or links to code/data.

    – pault
    Nov 13 '18 at 15:41



















  • For which column you want to do this?

    – karma4917
    Nov 13 '18 at 15:36











  • There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.

    – pault
    Nov 13 '18 at 15:38











  • Also, don't post pictures of or links to code/data.

    – pault
    Nov 13 '18 at 15:41

















For which column you want to do this?

– karma4917
Nov 13 '18 at 15:36





For which column you want to do this?

– karma4917
Nov 13 '18 at 15:36













There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.

– pault
Nov 13 '18 at 15:38





There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an XY problem. Please explain, in detail, what you are trying to do and try to edit your question to provide a reproducible example.

– pault
Nov 13 '18 at 15:38













Also, don't post pictures of or links to code/data.

– pault
Nov 13 '18 at 15:41





Also, don't post pictures of or links to code/data.

– pault
Nov 13 '18 at 15:41












2 Answers
2






active

oldest

votes


















0














I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).



from pyspark.sql import functions as F

var = df.select(F.col('column_you_want')).toPandas()


Then you can iterate on it like a normal pandas series.






share|improve this answer
























  • No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!

    – RAM SHANKER G
    Nov 13 '18 at 15:16



















0














col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)





share|improve this answer
























  • What if there are more number of rows? collect() operation will be costly right?

    – karma4917
    Nov 13 '18 at 16:40













  • repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.

    – Avinash
    Nov 13 '18 at 17:58











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283172%2fhow-can-i-iterate-through-a-column-of-a-spark-dataframe-and-access-the-values-in%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).



from pyspark.sql import functions as F

var = df.select(F.col('column_you_want')).toPandas()


Then you can iterate on it like a normal pandas series.






share|improve this answer
























  • No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!

    – RAM SHANKER G
    Nov 13 '18 at 15:16
















0














I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).



from pyspark.sql import functions as F

var = df.select(F.col('column_you_want')).toPandas()


Then you can iterate on it like a normal pandas series.






share|improve this answer
























  • No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!

    – RAM SHANKER G
    Nov 13 '18 at 15:16














0












0








0







I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).



from pyspark.sql import functions as F

var = df.select(F.col('column_you_want')).toPandas()


Then you can iterate on it like a normal pandas series.






share|improve this answer













I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).



from pyspark.sql import functions as F

var = df.select(F.col('column_you_want')).toPandas()


Then you can iterate on it like a normal pandas series.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 13 '18 at 15:14









ManriqueManrique

500113




500113













  • No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!

    – RAM SHANKER G
    Nov 13 '18 at 15:16



















  • No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!

    – RAM SHANKER G
    Nov 13 '18 at 15:16

















No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!

– RAM SHANKER G
Nov 13 '18 at 15:16





No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory!

– RAM SHANKER G
Nov 13 '18 at 15:16













0














col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)





share|improve this answer
























  • What if there are more number of rows? collect() operation will be costly right?

    – karma4917
    Nov 13 '18 at 16:40













  • repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.

    – Avinash
    Nov 13 '18 at 17:58
















0














col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)





share|improve this answer
























  • What if there are more number of rows? collect() operation will be costly right?

    – karma4917
    Nov 13 '18 at 16:40













  • repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.

    – Avinash
    Nov 13 '18 at 17:58














0












0








0







col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)





share|improve this answer













col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 13 '18 at 16:25









AvinashAvinash

12




12













  • What if there are more number of rows? collect() operation will be costly right?

    – karma4917
    Nov 13 '18 at 16:40













  • repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.

    – Avinash
    Nov 13 '18 at 17:58



















  • What if there are more number of rows? collect() operation will be costly right?

    – karma4917
    Nov 13 '18 at 16:40













  • repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.

    – Avinash
    Nov 13 '18 at 17:58

















What if there are more number of rows? collect() operation will be costly right?

– karma4917
Nov 13 '18 at 16:40







What if there are more number of rows? collect() operation will be costly right?

– karma4917
Nov 13 '18 at 16:40















repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.

– Avinash
Nov 13 '18 at 17:58





repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs.

– Avinash
Nov 13 '18 at 17:58


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283172%2fhow-can-i-iterate-through-a-column-of-a-spark-dataframe-and-access-the-values-in%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python