R: how to debug “factor has new levels” error for linear model and prediction

up vote
1
down vote

favorite

I am trying to make and test a linear model as follows:

lm_model <- lm(Purchase ~., data = train)

lm_prediction <- predict(lm_model, test)

This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):

factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18

However, if I check these they definitely look to appear in both data frames:

> nrow(subset(train, Product_Category_1 == "7"))

[1] 2923

> nrow(subset(test, Product_Category_1 == "7"))

[1] 745

> nrow(subset(train, Product_Category_1 == "9"))

[1] 312

> nrow(subset(test, Product_Category_1 == "9"))

[1] 92

Also showing the table for train and test show they have the same factors:

> table(train$Product_Category_1)



     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15     16     17     18 

110820  18818  15820   9265 118955  16159   2923  89511    312   4030  19113   3108   4407   1201   4991   7730    467   2430 

> table(test$Product_Category_1)



    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 

27533  4681  4029  2301 29637  4005   745 22621    92  1002  4847   767  1033   299  1212  1967   100   645 

>

edited Jul 27 at 11:43

李哲源

46.8k1491142

asked Jul 27 at 6:46

ZhouW

342423

1

What do table(train$Product_Category_1) and table(test$Product_Category_1) tell you? There is not much we can do to help without a reproducible example here.
– Roland
Jul 27 at 6:53

table(train$Product_Category_1) and table(test$Product_Category_1) show they have the same factors (edited the post)
– ZhouW
Jul 27 at 7:03

still need a reproducible example. Read this: stackoverflow.com/questions/5963269/…
– dww
Jul 27 at 8:13

1

One possibility are NA values in other variables. lm applies na.omit and thereby could remove all observations of a specific factor level.
– Roland
Jul 27 at 8:48

add a comment |

up vote
1
down vote

favorite

I am trying to make and test a linear model as follows:

lm_model <- lm(Purchase ~., data = train)

lm_prediction <- predict(lm_model, test)

This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):

factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18

However, if I check these they definitely look to appear in both data frames:

> nrow(subset(train, Product_Category_1 == "7"))

[1] 2923

> nrow(subset(test, Product_Category_1 == "7"))

[1] 745

> nrow(subset(train, Product_Category_1 == "9"))

[1] 312

> nrow(subset(test, Product_Category_1 == "9"))

[1] 92

Also showing the table for train and test show they have the same factors:

> table(train$Product_Category_1)



     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15     16     17     18 

110820  18818  15820   9265 118955  16159   2923  89511    312   4030  19113   3108   4407   1201   4991   7730    467   2430 

> table(test$Product_Category_1)



    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 

27533  4681  4029  2301 29637  4005   745 22621    92  1002  4847   767  1033   299  1212  1967   100   645 

>

edited Jul 27 at 11:43

李哲源

46.8k1491142

asked Jul 27 at 6:46

ZhouW

342423

1

What do table(train$Product_Category_1) and table(test$Product_Category_1) tell you? There is not much we can do to help without a reproducible example here.
– Roland
Jul 27 at 6:53

table(train$Product_Category_1) and table(test$Product_Category_1) show they have the same factors (edited the post)
– ZhouW
Jul 27 at 7:03

still need a reproducible example. Read this: stackoverflow.com/questions/5963269/…
– dww
Jul 27 at 8:13

1

One possibility are NA values in other variables. lm applies na.omit and thereby could remove all observations of a specific factor level.
– Roland
Jul 27 at 8:48

add a comment |

up vote
1
down vote

favorite

I am trying to make and test a linear model as follows:

lm_model <- lm(Purchase ~., data = train)

lm_prediction <- predict(lm_model, test)

This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):

factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18

However, if I check these they definitely look to appear in both data frames:

> nrow(subset(train, Product_Category_1 == "7"))

[1] 2923

> nrow(subset(test, Product_Category_1 == "7"))

[1] 745

> nrow(subset(train, Product_Category_1 == "9"))

[1] 312

> nrow(subset(test, Product_Category_1 == "9"))

[1] 92

Also showing the table for train and test show they have the same factors:

> table(train$Product_Category_1)



     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15     16     17     18 

110820  18818  15820   9265 118955  16159   2923  89511    312   4030  19113   3108   4407   1201   4991   7730    467   2430 

> table(test$Product_Category_1)



    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 

27533  4681  4029  2301 29637  4005   745 22621    92  1002  4847   767  1033   299  1212  1967   100   645 

>

edited Jul 27 at 11:43

李哲源

46.8k1491142

asked Jul 27 at 6:46

ZhouW

342423

I am trying to make and test a linear model as follows:

lm_model <- lm(Purchase ~., data = train)

lm_prediction <- predict(lm_model, test)

This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):

factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18

However, if I check these they definitely look to appear in both data frames:

> nrow(subset(train, Product_Category_1 == "7"))

[1] 2923

> nrow(subset(test, Product_Category_1 == "7"))

[1] 745

> nrow(subset(train, Product_Category_1 == "9"))

[1] 312

> nrow(subset(test, Product_Category_1 == "9"))

[1] 92

Also showing the table for train and test show they have the same factors:

> table(train$Product_Category_1)



     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15     16     17     18 

110820  18818  15820   9265 118955  16159   2923  89511    312   4030  19113   3108   4407   1201   4991   7730    467   2430 

> table(test$Product_Category_1)



    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 

27533  4681  4029  2301 29637  4005   745 22621    92  1002  4847   767  1033   299  1212  1967   100   645 

>

r regression linear-regression prediction lm

edited Jul 27 at 11:43

李哲源

46.8k1491142

asked Jul 27 at 6:46

ZhouW

342423

edited Jul 27 at 11:43

李哲源

46.8k1491142

asked Jul 27 at 6:46

ZhouW

342423

edited Jul 27 at 11:43

李哲源

46.8k1491142

edited Jul 27 at 11:43

李哲源

46.8k1491142

edited Jul 27 at 11:43

李哲源

46.8k1491142

asked Jul 27 at 6:46

ZhouW

342423

asked Jul 27 at 6:46

ZhouW

342423

asked Jul 27 at 6:46

ZhouW

342423

1

What do table(train$Product_Category_1) and table(test$Product_Category_1) tell you? There is not much we can do to help without a reproducible example here.
– Roland
Jul 27 at 6:53

table(train$Product_Category_1) and table(test$Product_Category_1) show they have the same factors (edited the post)
– ZhouW
Jul 27 at 7:03

still need a reproducible example. Read this: stackoverflow.com/questions/5963269/…
– dww
Jul 27 at 8:13

1

One possibility are NA values in other variables. lm applies na.omit and thereby could remove all observations of a specific factor level.
– Roland
Jul 27 at 8:48

add a comment |

1

What do table(train$Product_Category_1) and table(test$Product_Category_1) tell you? There is not much we can do to help without a reproducible example here.
– Roland
Jul 27 at 6:53

table(train$Product_Category_1) and table(test$Product_Category_1) show they have the same factors (edited the post)
– ZhouW
Jul 27 at 7:03

still need a reproducible example. Read this: stackoverflow.com/questions/5963269/…
– dww
Jul 27 at 8:13

1

One possibility are NA values in other variables. lm applies na.omit and thereby could remove all observations of a specific factor level.
– Roland
Jul 27 at 8:48

What do table(train$Product_Category_1) and table(test$Product_Category_1) tell you? There is not much we can do to help without a reproducible example here.
– Roland
Jul 27 at 6:53

table(train$Product_Category_1) and table(test$Product_Category_1) show they have the same factors (edited the post)
– ZhouW
Jul 27 at 7:03

still need a reproducible example. Read this: stackoverflow.com/questions/5963269/…
– dww
Jul 27 at 8:13

One possibility are NA values in other variables. lm applies na.omit and thereby could remove all observations of a specific factor level.
– Roland
Jul 27 at 8:48

add a comment |

1 Answer
1

active

oldest

votes

up vote
5
down vote

Table of Contents:

A simple example for walkthrough

Suggestion for users

Helpful information that we can get from the fitted model object

OK, I see what the problem is now, but how to make predict work?

Is there a better way to avoid such problem at all?

A simple example for walkthrough

Here is simple enough reproducible example to hint you what has happened.

train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))

test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))

fit <- lm(y ~ x + f, data = train)

predict(fit, newdata = test)

#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 

#  factor f has new levels d

I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.

If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:

fit <- lm(y ~ x + f, data = train, na.action = na.pass)

#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 

#  NA/NaN/Inf in 'x'

Let's remove NA from the training dataset.

train <- na.omit(train)

train$f

#[1] a b c

#Levels: a b c d

f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):

## source code of lm; don't run

mf$drop.unused.levels <- TRUE

mf[[1L]] <- quote(stats::model.frame)

mf <- eval(mf, parent.frame())

This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)

model.matrix(y ~ x + f, data = mf)

#  (Intercept)          x fb fc fd

#1           1 0.90021178  0  0  0

#2           1 0.10188534  1  0  0

#3           1 0.05881954  0  1  0

#attr(,"assign")

#[1] 0 1 2 2 2

#attr(,"contrasts")

#attr(,"contrasts")$f

#[1] "contr.treatment"

This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)

model.matrix(y ~ x + f, data = mf)    

#  (Intercept)          x fb fc

#1           1 0.90021178  0  0

#2           1 0.10188534  1  0

#3           1 0.05881954  0  1

#attr(,"assign")

#[1] 0 1 2 2

#attr(,"contrasts")

#attr(,"contrasts")$f

#[1] "contr.treatment"

The fd is gone, and

mf$f

#[1] a b c

#Levels: a b c

The now non-existing "d" level will cause the "new factor level" error in predict.

Suggestion for users

It is highly recommended that all users do the following manually when fitting models:

[No. 1] remove incomplete cases;

[No. 2] drop unused factor levels.

This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.

Note, there should be another recommendation in the list:

[No. 0] do subsetting yourself

Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.

The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.

Helpful information that we can get from the fitted model object

lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.

fit$xlevels

#$f

#[1] "a" "b" "c"

So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.

If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.

OK, I see what the problem is now, but how to make predict work?

If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.

Is there a better way to avoid such problem at all?

People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with

some factor levels in test but not in train (oops, we get "new factor level" error when using predict);

some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);

So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.

There is in fact another hazard, but not causing programming errors:

the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).

Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.

edited Jul 27 at 15:25

answered Jul 27 at 10:25

李哲源

46.8k1491142

Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
– Roland
Jul 30 at 8:30

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51552203%2fr-how-to-debug-factor-has-new-levels-error-for-linear-model-and-prediction%23new-answer', 'question_page');
}
);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
5
down vote

Table of Contents:

A simple example for walkthrough

Suggestion for users

Helpful information that we can get from the fitted model object

OK, I see what the problem is now, but how to make predict work?

Is there a better way to avoid such problem at all?

A simple example for walkthrough

Here is simple enough reproducible example to hint you what has happened.

train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))

test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))

fit <- lm(y ~ x + f, data = train)

predict(fit, newdata = test)

#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 

#  factor f has new levels d

I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.

fit <- lm(y ~ x + f, data = train, na.action = na.pass)

#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 

#  NA/NaN/Inf in 'x'

Let's remove NA from the training dataset.

train <- na.omit(train)

train$f

#[1] a b c

#Levels: a b c d

f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):

## source code of lm; don't run

mf$drop.unused.levels <- TRUE

mf[[1L]] <- quote(stats::model.frame)

mf <- eval(mf, parent.frame())

This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)

model.matrix(y ~ x + f, data = mf)

#  (Intercept)          x fb fc fd

#1           1 0.90021178  0  0  0

#2           1 0.10188534  1  0  0

#3           1 0.05881954  0  1  0

#attr(,"assign")

#[1] 0 1 2 2 2

#attr(,"contrasts")

#attr(,"contrasts")$f

#[1] "contr.treatment"

This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)

model.matrix(y ~ x + f, data = mf)    

#  (Intercept)          x fb fc

#1           1 0.90021178  0  0

#2           1 0.10188534  1  0

#3           1 0.05881954  0  1

#attr(,"assign")

#[1] 0 1 2 2

#attr(,"contrasts")

#attr(,"contrasts")$f

#[1] "contr.treatment"

The fd is gone, and

mf$f

#[1] a b c

#Levels: a b c

The now non-existing "d" level will cause the "new factor level" error in predict.

Suggestion for users

It is highly recommended that all users do the following manually when fitting models:

[No. 1] remove incomplete cases;

[No. 2] drop unused factor levels.

Note, there should be another recommendation in the list:

[No. 0] do subsetting yourself

Helpful information that we can get from the fitted model object

lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.

fit$xlevels

#$f

#[1] "a" "b" "c"

So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.

OK, I see what the problem is now, but how to make predict work?

Is there a better way to avoid such problem at all?

some factor levels in test but not in train (oops, we get "new factor level" error when using predict);

some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);

So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.

There is in fact another hazard, but not causing programming errors:

the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).

edited Jul 27 at 15:25

answered Jul 27 at 10:25

李哲源

46.8k1491142

Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
– Roland
Jul 30 at 8:30

add a comment |

up vote
5
down vote

Table of Contents:

A simple example for walkthrough

Suggestion for users

Helpful information that we can get from the fitted model object

OK, I see what the problem is now, but how to make predict work?

Is there a better way to avoid such problem at all?

A simple example for walkthrough

Here is simple enough reproducible example to hint you what has happened.

train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))

test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))

fit <- lm(y ~ x + f, data = train)

predict(fit, newdata = test)

#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 

#  factor f has new levels d

I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.

fit <- lm(y ~ x + f, data = train, na.action = na.pass)

#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 

#  NA/NaN/Inf in 'x'

Let's remove NA from the training dataset.

train <- na.omit(train)

train$f

#[1] a b c

#Levels: a b c d

f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):

## source code of lm; don't run

mf$drop.unused.levels <- TRUE

mf[[1L]] <- quote(stats::model.frame)

mf <- eval(mf, parent.frame())

This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)

model.matrix(y ~ x + f, data = mf)

#  (Intercept)          x fb fc fd

#1           1 0.90021178  0  0  0

#2           1 0.10188534  1  0  0

#3           1 0.05881954  0  1  0

#attr(,"assign")

#[1] 0 1 2 2 2

#attr(,"contrasts")

#attr(,"contrasts")$f

#[1] "contr.treatment"

This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)

model.matrix(y ~ x + f, data = mf)    

#  (Intercept)          x fb fc

#1           1 0.90021178  0  0

#2           1 0.10188534  1  0

#3           1 0.05881954  0  1

#attr(,"assign")

#[1] 0 1 2 2

#attr(,"contrasts")

#attr(,"contrasts")$f

#[1] "contr.treatment"

The fd is gone, and

mf$f

#[1] a b c

#Levels: a b c

The now non-existing "d" level will cause the "new factor level" error in predict.

Suggestion for users

It is highly recommended that all users do the following manually when fitting models:

[No. 1] remove incomplete cases;

[No. 2] drop unused factor levels.

Note, there should be another recommendation in the list:

[No. 0] do subsetting yourself

Helpful information that we can get from the fitted model object

lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.

fit$xlevels

#$f

#[1] "a" "b" "c"

So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.

OK, I see what the problem is now, but how to make predict work?

Is there a better way to avoid such problem at all?

some factor levels in test but not in train (oops, we get "new factor level" error when using predict);

some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);

So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.

There is in fact another hazard, but not causing programming errors:

the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).

edited Jul 27 at 15:25

answered Jul 27 at 10:25

李哲源

46.8k1491142

Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
– Roland
Jul 30 at 8:30

add a comment |

up vote
5
down vote

Table of Contents:

A simple example for walkthrough

Suggestion for users

Helpful information that we can get from the fitted model object

OK, I see what the problem is now, but how to make predict work?

Is there a better way to avoid such problem at all?

A simple example for walkthrough

Here is simple enough reproducible example to hint you what has happened.

train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))

test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))

fit <- lm(y ~ x + f, data = train)

predict(fit, newdata = test)

#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 

#  factor f has new levels d

I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.

fit <- lm(y ~ x + f, data = train, na.action = na.pass)

#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 

#  NA/NaN/Inf in 'x'

Let's remove NA from the training dataset.

train <- na.omit(train)

train$f

#[1] a b c

#Levels: a b c d

f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):

## source code of lm; don't run

mf$drop.unused.levels <- TRUE

mf[[1L]] <- quote(stats::model.frame)

mf <- eval(mf, parent.frame())

This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)

model.matrix(y ~ x + f, data = mf)

#  (Intercept)          x fb fc fd

#1           1 0.90021178  0  0  0

#2           1 0.10188534  1  0  0

#3           1 0.05881954  0  1  0

#attr(,"assign")

#[1] 0 1 2 2 2

#attr(,"contrasts")

#attr(,"contrasts")$f

#[1] "contr.treatment"

This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)

model.matrix(y ~ x + f, data = mf)    

#  (Intercept)          x fb fc

#1           1 0.90021178  0  0

#2           1 0.10188534  1  0

#3           1 0.05881954  0  1

#attr(,"assign")

#[1] 0 1 2 2

#attr(,"contrasts")

#attr(,"contrasts")$f

#[1] "contr.treatment"

The fd is gone, and

mf$f

#[1] a b c

#Levels: a b c

The now non-existing "d" level will cause the "new factor level" error in predict.

Suggestion for users

It is highly recommended that all users do the following manually when fitting models:

[No. 1] remove incomplete cases;

[No. 2] drop unused factor levels.

Note, there should be another recommendation in the list:

[No. 0] do subsetting yourself

Helpful information that we can get from the fitted model object

lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.

fit$xlevels

#$f

#[1] "a" "b" "c"

So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.

OK, I see what the problem is now, but how to make predict work?

Is there a better way to avoid such problem at all?

some factor levels in test but not in train (oops, we get "new factor level" error when using predict);

some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);

So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.

There is in fact another hazard, but not causing programming errors:

the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).

edited Jul 27 at 15:25

answered Jul 27 at 10:25

李哲源

46.8k1491142

Table of Contents:

A simple example for walkthrough

Suggestion for users

Helpful information that we can get from the fitted model object

OK, I see what the problem is now, but how to make predict work?

Is there a better way to avoid such problem at all?

A simple example for walkthrough

Here is simple enough reproducible example to hint you what has happened.

train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))

test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))

fit <- lm(y ~ x + f, data = train)

predict(fit, newdata = test)

#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 

#  factor f has new levels d

I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.

fit <- lm(y ~ x + f, data = train, na.action = na.pass)

#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 

#  NA/NaN/Inf in 'x'

Let's remove NA from the training dataset.

train <- na.omit(train)

train$f

#[1] a b c

#Levels: a b c d

f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):

## source code of lm; don't run

mf$drop.unused.levels <- TRUE

mf[[1L]] <- quote(stats::model.frame)

mf <- eval(mf, parent.frame())

This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)

model.matrix(y ~ x + f, data = mf)

#  (Intercept)          x fb fc fd

#1           1 0.90021178  0  0  0

#2           1 0.10188534  1  0  0

#3           1 0.05881954  0  1  0

#attr(,"assign")

#[1] 0 1 2 2 2

#attr(,"contrasts")

#attr(,"contrasts")$f

#[1] "contr.treatment"

This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)

model.matrix(y ~ x + f, data = mf)    

#  (Intercept)          x fb fc

#1           1 0.90021178  0  0

#2           1 0.10188534  1  0

#3           1 0.05881954  0  1

#attr(,"assign")

#[1] 0 1 2 2

#attr(,"contrasts")

#attr(,"contrasts")$f

#[1] "contr.treatment"

The fd is gone, and

mf$f

#[1] a b c

#Levels: a b c

The now non-existing "d" level will cause the "new factor level" error in predict.

Suggestion for users

It is highly recommended that all users do the following manually when fitting models:

[No. 1] remove incomplete cases;

[No. 2] drop unused factor levels.

Note, there should be another recommendation in the list:

[No. 0] do subsetting yourself

Helpful information that we can get from the fitted model object

lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.

fit$xlevels

#$f

#[1] "a" "b" "c"

So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.

OK, I see what the problem is now, but how to make predict work?

Is there a better way to avoid such problem at all?

some factor levels in test but not in train (oops, we get "new factor level" error when using predict);

some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);

So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.

There is in fact another hazard, but not causing programming errors:

the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).

edited Jul 27 at 15:25

answered Jul 27 at 10:25

李哲源

46.8k1491142

edited Jul 27 at 15:25

answered Jul 27 at 10:25

李哲源

46.8k1491142

answered Jul 27 at 10:25

李哲源

46.8k1491142

answered Jul 27 at 10:25

李哲源

46.8k1491142

Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
– Roland
Jul 30 at 8:30

add a comment |

Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
– Roland
Jul 30 at 8:30

Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
– Roland
Jul 30 at 8:30

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

1IlaxXfX0e iMRPz LO zLqNV8,XOp ty7jD1JfV ISyioc5C

搜尋此網誌

Vfrdtyky