Pandas: Drop Columns to maximize rows without NA
I have a dataset with many columns and a meaningful amount of rows where one column is na
. I would like to do something like df.dropna()
, but columnwise and with the objective of maximizing the number of rows with no na
.
A bit of background on the dataset . . . many of the columns can be considered part of the 'core' dataset, and these core columns will almost always be without na. The other columns (non-core) are less complete, but vary in their completeness. In the non-core columns there is no pattern to if a value is missing - a row missing data in one column is not more likely to be missing data in another column.
Is there an established way to go about this? If not . . .
I was thinking it would be possible to:
- Extract names of the core columns into a list
- Extract the names of the non-core columns into another list
- Use https://docs.python.org/3/library/itertools.html#itertools.combinations
to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc). - For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.
- Log all of these and pick the one that yielded the most rows.
Does anyone have any experience doing something like this?
Thank you
python pandas
add a comment |
I have a dataset with many columns and a meaningful amount of rows where one column is na
. I would like to do something like df.dropna()
, but columnwise and with the objective of maximizing the number of rows with no na
.
A bit of background on the dataset . . . many of the columns can be considered part of the 'core' dataset, and these core columns will almost always be without na. The other columns (non-core) are less complete, but vary in their completeness. In the non-core columns there is no pattern to if a value is missing - a row missing data in one column is not more likely to be missing data in another column.
Is there an established way to go about this? If not . . .
I was thinking it would be possible to:
- Extract names of the core columns into a list
- Extract the names of the non-core columns into another list
- Use https://docs.python.org/3/library/itertools.html#itertools.combinations
to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc). - For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.
- Log all of these and pick the one that yielded the most rows.
Does anyone have any experience doing something like this?
Thank you
python pandas
You can usepd.dropna()
and pass itaxis=1
for columnwise drop, andthresh=.75*len(df)
if, for example, you want to drop anything with more than that thresholdNA
values
– G. Anderson
Nov 13 '18 at 23:23
add a comment |
I have a dataset with many columns and a meaningful amount of rows where one column is na
. I would like to do something like df.dropna()
, but columnwise and with the objective of maximizing the number of rows with no na
.
A bit of background on the dataset . . . many of the columns can be considered part of the 'core' dataset, and these core columns will almost always be without na. The other columns (non-core) are less complete, but vary in their completeness. In the non-core columns there is no pattern to if a value is missing - a row missing data in one column is not more likely to be missing data in another column.
Is there an established way to go about this? If not . . .
I was thinking it would be possible to:
- Extract names of the core columns into a list
- Extract the names of the non-core columns into another list
- Use https://docs.python.org/3/library/itertools.html#itertools.combinations
to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc). - For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.
- Log all of these and pick the one that yielded the most rows.
Does anyone have any experience doing something like this?
Thank you
python pandas
I have a dataset with many columns and a meaningful amount of rows where one column is na
. I would like to do something like df.dropna()
, but columnwise and with the objective of maximizing the number of rows with no na
.
A bit of background on the dataset . . . many of the columns can be considered part of the 'core' dataset, and these core columns will almost always be without na. The other columns (non-core) are less complete, but vary in their completeness. In the non-core columns there is no pattern to if a value is missing - a row missing data in one column is not more likely to be missing data in another column.
Is there an established way to go about this? If not . . .
I was thinking it would be possible to:
- Extract names of the core columns into a list
- Extract the names of the non-core columns into another list
- Use https://docs.python.org/3/library/itertools.html#itertools.combinations
to generate all possible combinations of the non-core columns (each column by iteself, then all combos of 2, all combos of 3 etc). - For each combination of the core columns and the current itterator generated combination of non-core columns subset the dataframe and count the rows without na.
- Log all of these and pick the one that yielded the most rows.
Does anyone have any experience doing something like this?
Thank you
python pandas
python pandas
asked Nov 13 '18 at 23:19
Kelley BradyKelley Brady
725
725
You can usepd.dropna()
and pass itaxis=1
for columnwise drop, andthresh=.75*len(df)
if, for example, you want to drop anything with more than that thresholdNA
values
– G. Anderson
Nov 13 '18 at 23:23
add a comment |
You can usepd.dropna()
and pass itaxis=1
for columnwise drop, andthresh=.75*len(df)
if, for example, you want to drop anything with more than that thresholdNA
values
– G. Anderson
Nov 13 '18 at 23:23
You can use
pd.dropna()
and pass it axis=1
for columnwise drop, and thresh=.75*len(df)
if, for example, you want to drop anything with more than that threshold NA
values– G. Anderson
Nov 13 '18 at 23:23
You can use
pd.dropna()
and pass it axis=1
for columnwise drop, and thresh=.75*len(df)
if, for example, you want to drop anything with more than that threshold NA
values– G. Anderson
Nov 13 '18 at 23:23
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290931%2fpandas-drop-columns-to-maximize-rows-without-na%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290931%2fpandas-drop-columns-to-maximize-rows-without-na%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You can use
pd.dropna()
and pass itaxis=1
for columnwise drop, andthresh=.75*len(df)
if, for example, you want to drop anything with more than that thresholdNA
values– G. Anderson
Nov 13 '18 at 23:23