Pandas: Drop Columns to maximize rows without NA

I have a dataset with many columns and a meaningful amount of rows where one column is na. I would like to do something like df.dropna(), but columnwise and with the objective of maximizing the number of rows with no na.

A bit of background on the dataset . . . many of the columns can be considered part of the 'core' dataset, and these core columns will almost always be without na. The other columns (non-core) are less complete, but vary in their completeness. In the non-core columns there is no pattern to if a value is missing - a row missing data in one column is not more likely to be missing data in another column.

Is there an established way to go about this? If not . . .

I was thinking it would be possible to:

Extract names of the core columns into a list

Extract the names of the non-core columns into another list

Use https://docs.python.org/3/library/itertools.html#itertools.combinations
to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc).

For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.

Log all of these and pick the one that yielded the most rows.

Does anyone have any experience doing something like this?

Thank you

asked Nov 13 '18 at 23:19

Kelley Brady

725

You can use pd.dropna() and pass it axis=1 for columnwise drop, and thresh=.75*len(df) if, for example, you want to drop anything with more than that threshold NA values

– G. Anderson
Nov 13 '18 at 23:23

add a comment |

Is there an established way to go about this? If not . . .

I was thinking it would be possible to:

Extract names of the core columns into a list

Extract the names of the non-core columns into another list

Use https://docs.python.org/3/library/itertools.html#itertools.combinations
to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc).

For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.

Log all of these and pick the one that yielded the most rows.

Does anyone have any experience doing something like this?

Thank you

asked Nov 13 '18 at 23:19

Kelley Brady

725

You can use pd.dropna() and pass it axis=1 for columnwise drop, and thresh=.75*len(df) if, for example, you want to drop anything with more than that threshold NA values

– G. Anderson
Nov 13 '18 at 23:23

add a comment |

Is there an established way to go about this? If not . . .

I was thinking it would be possible to:

Extract names of the core columns into a list

Extract the names of the non-core columns into another list

Use https://docs.python.org/3/library/itertools.html#itertools.combinations
to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc).

For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.

Log all of these and pick the one that yielded the most rows.

Does anyone have any experience doing something like this?

Thank you

asked Nov 13 '18 at 23:19

Kelley Brady

725

Is there an established way to go about this? If not . . .

I was thinking it would be possible to:

Extract names of the core columns into a list

Extract the names of the non-core columns into another list

Use https://docs.python.org/3/library/itertools.html#itertools.combinations
to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc).

For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.

Log all of these and pick the one that yielded the most rows.

Does anyone have any experience doing something like this?

Thank you

python pandas

asked Nov 13 '18 at 23:19

Kelley Brady

725

asked Nov 13 '18 at 23:19

Kelley Brady

725

asked Nov 13 '18 at 23:19

Kelley Brady

725

asked Nov 13 '18 at 23:19

Kelley Brady

725

asked Nov 13 '18 at 23:19

Kelley Brady

725

You can use pd.dropna() and pass it axis=1 for columnwise drop, and thresh=.75*len(df) if, for example, you want to drop anything with more than that threshold NA values

– G. Anderson
Nov 13 '18 at 23:23

add a comment |

You can use pd.dropna() and pass it axis=1 for columnwise drop, and thresh=.75*len(df) if, for example, you want to drop anything with more than that threshold NA values

– G. Anderson
Nov 13 '18 at 23:23

You can use pd.dropna() and pass it axis=1 for columnwise drop, and thresh=.75*len(df) if, for example, you want to drop anything with more than that threshold NA values

– G. Anderson
Nov 13 '18 at 23:23

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290931%2fpandas-drop-columns-to-maximize-rows-without-na%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky