Pandas: Drop Columns to maximize rows without NA












0















I have a dataset with many columns and a meaningful amount of rows where one column is na. I would like to do something like df.dropna(), but columnwise and with the objective of maximizing the number of rows with no na.



A bit of background on the dataset . . . many of the columns can be considered part of the 'core' dataset, and these core columns will almost always be without na. The other columns (non-core) are less complete, but vary in their completeness. In the non-core columns there is no pattern to if a value is missing - a row missing data in one column is not more likely to be missing data in another column.



Is there an established way to go about this? If not . . .



I was thinking it would be possible to:




  1. Extract names of the core columns into a list

  2. Extract the names of the non-core columns into another list

  3. Use https://docs.python.org/3/library/itertools.html#itertools.combinations
    to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc).

  4. For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.

  5. Log all of these and pick the one that yielded the most rows.


Does anyone have any experience doing something like this?



Thank you










share|improve this question























  • You can use pd.dropna() and pass it axis=1 for columnwise drop, and thresh=.75*len(df) if, for example, you want to drop anything with more than that threshold NA values

    – G. Anderson
    Nov 13 '18 at 23:23
















0















I have a dataset with many columns and a meaningful amount of rows where one column is na. I would like to do something like df.dropna(), but columnwise and with the objective of maximizing the number of rows with no na.



A bit of background on the dataset . . . many of the columns can be considered part of the 'core' dataset, and these core columns will almost always be without na. The other columns (non-core) are less complete, but vary in their completeness. In the non-core columns there is no pattern to if a value is missing - a row missing data in one column is not more likely to be missing data in another column.



Is there an established way to go about this? If not . . .



I was thinking it would be possible to:




  1. Extract names of the core columns into a list

  2. Extract the names of the non-core columns into another list

  3. Use https://docs.python.org/3/library/itertools.html#itertools.combinations
    to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc).

  4. For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.

  5. Log all of these and pick the one that yielded the most rows.


Does anyone have any experience doing something like this?



Thank you










share|improve this question























  • You can use pd.dropna() and pass it axis=1 for columnwise drop, and thresh=.75*len(df) if, for example, you want to drop anything with more than that threshold NA values

    – G. Anderson
    Nov 13 '18 at 23:23














0












0








0








I have a dataset with many columns and a meaningful amount of rows where one column is na. I would like to do something like df.dropna(), but columnwise and with the objective of maximizing the number of rows with no na.



A bit of background on the dataset . . . many of the columns can be considered part of the 'core' dataset, and these core columns will almost always be without na. The other columns (non-core) are less complete, but vary in their completeness. In the non-core columns there is no pattern to if a value is missing - a row missing data in one column is not more likely to be missing data in another column.



Is there an established way to go about this? If not . . .



I was thinking it would be possible to:




  1. Extract names of the core columns into a list

  2. Extract the names of the non-core columns into another list

  3. Use https://docs.python.org/3/library/itertools.html#itertools.combinations
    to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc).

  4. For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.

  5. Log all of these and pick the one that yielded the most rows.


Does anyone have any experience doing something like this?



Thank you










share|improve this question














I have a dataset with many columns and a meaningful amount of rows where one column is na. I would like to do something like df.dropna(), but columnwise and with the objective of maximizing the number of rows with no na.



A bit of background on the dataset . . . many of the columns can be considered part of the 'core' dataset, and these core columns will almost always be without na. The other columns (non-core) are less complete, but vary in their completeness. In the non-core columns there is no pattern to if a value is missing - a row missing data in one column is not more likely to be missing data in another column.



Is there an established way to go about this? If not . . .



I was thinking it would be possible to:




  1. Extract names of the core columns into a list

  2. Extract the names of the non-core columns into another list

  3. Use https://docs.python.org/3/library/itertools.html#itertools.combinations
    to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc).

  4. For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.

  5. Log all of these and pick the one that yielded the most rows.


Does anyone have any experience doing something like this?



Thank you







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 23:19









Kelley BradyKelley Brady

725




725













  • You can use pd.dropna() and pass it axis=1 for columnwise drop, and thresh=.75*len(df) if, for example, you want to drop anything with more than that threshold NA values

    – G. Anderson
    Nov 13 '18 at 23:23



















  • You can use pd.dropna() and pass it axis=1 for columnwise drop, and thresh=.75*len(df) if, for example, you want to drop anything with more than that threshold NA values

    – G. Anderson
    Nov 13 '18 at 23:23

















You can use pd.dropna() and pass it axis=1 for columnwise drop, and thresh=.75*len(df) if, for example, you want to drop anything with more than that threshold NA values

– G. Anderson
Nov 13 '18 at 23:23





You can use pd.dropna() and pass it axis=1 for columnwise drop, and thresh=.75*len(df) if, for example, you want to drop anything with more than that threshold NA values

– G. Anderson
Nov 13 '18 at 23:23












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290931%2fpandas-drop-columns-to-maximize-rows-without-na%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290931%2fpandas-drop-columns-to-maximize-rows-without-na%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python