Vfrdtyky

Question

Let's suppose we have the following vector:

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)

Given a sequence of numbers, for instance c(2,3,5,8), I am trying to find what is the position of this sequence of numbers in the vector v. The result I expect is something like:

FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE

I am trying to use which(v == c(2,3,5,8)) but it doesn't give me what I am looking for.

Thanks beforehand.

@akrun I want to find the exact beginning and end of the sequence. The first element is 2, followed by 3, and so on... Does it clarify? — Feb 7 '18 at 9:54
Keep in mind this will not be possible with floats due to the usual binary representation limits. You might be able to modify any of the given solutions, replacing == with all.equal or cgwtools::approxeq (tooting my own horn there) — Feb 7 '18 at 14:34
Seems like you just a string search algorithm en.wikipedia.org/wiki/String_searching_algorithm — Feb 8 '18 at 4:20
@Alexander A string search algorithm isn't the most efficient solution in this case. See the benchmarks for examples. — Feb 10 '18 at 7:08

score 21 · Accepted Answer · 2018-02-07 10:08:59Z

21

Using base R you could do the following:

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)

x <- c(2,3,5,8)



idx <- which(v == x[1])

idx[sapply(idx, function(i) all(v[i:(i+(length(x)-1))] == x))]

# [1]  2 12

This tells you that the exact sequence appears twice, starting at positions 2 and 12 of your vector v.

It first checks the possible starting positions, i.e. where v equals the first value of x and then loops through these positions to check if the values after these positions also equal the other values of x.

edited Feb 7 '18 at 10:08

answered Feb 7 '18 at 10:05

docendo discimus

51.7k1178116

I was going to suggest something like which(colSums(t(embed(v, length(x))[, length(x):1]) == x) == length(x)), but I think this is easy to follow....

– A5C1D2H2I1M1N2O1R2T1
Feb 7 '18 at 10:06

@A5C1D2H2I1M1N2O1R2T1, that looks indeed a little hard to follow

– docendo discimus
Feb 7 '18 at 10:10

2

Worth noting, idx <- which(v == x[1]) is an important step. While other answers are going through all 1:4 shift variations 14 times, this answer does it in 3 steps.

– zx8754
Feb 7 '18 at 10:51

@zx8754, ...but the "data.table" approach still manages to win in terms of speed in a couple of tests I did with larger vecs....

– A5C1D2H2I1M1N2O1R2T1
Feb 7 '18 at 10:52

add a comment |

score 16 · Accepted Answer · 2018-02-09 15:02:38Z

16

Two other approaches using the shift-function trom data.table:

library(data.table)



# option 1

which(rowSums(mapply('==',

                     shift(v, type = 'lead', n = 0:(length(x) - 1)),

                     x)

              ) == length(x))



# option 2

which(Reduce("+", Map('==',

                      shift(v, type = 'lead', n = 0:(length(x) - 1)),

                      x)

             ) == length(x))

both give:

[1]  2 12

To get a full vector of the matching positions:

l <- length(x)

w <- which(Reduce("+", Map('==',

                           shift(v, type = 'lead', n = 0:(l - 1)),

                           x)

                  ) == l)

rep(w, each = l) + 0:(l-1)

which gives:

[1]  2  3  4  5 12 13 14 15

The benchmark which was included earlier in this answer has been moved to a separate community wiki answer.

Used data:

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)

x <- c(2,3,5,8)

edited Feb 9 '18 at 15:02

answered Feb 7 '18 at 10:24

Jaap

56.2k20119132

1

Many of these solutions don't give the desired output, the extra step is not cost free

– Moody_Mudskipper
Feb 8 '18 at 11:29

4

@989 I will update, but didn't have the time yet. What I do not understand is that you downvote me but don't downvote the invalid answers. What's the reason for that? Furthermore: why didn't you comment under the invalid answers so that they get a chance to improve?

– Jaap
Feb 9 '18 at 10:46

2

@989 You could always suggest Edit to this post or provide your own benchmark on your own post with explanation why Jaap's is wrong. No need for this kind of tone.

– zx8754
Feb 9 '18 at 10:57

1

Not sure whose idea it was, but getting a full vector of matching positions concatenated together seems like a bad idea. If I test with x = c(1,1,1) then I may find positions appearing multiple times. Besides, it's redundant -- the informational content of the first position is enough... Anyway, not a big deal, just my two cents ... noticed it all over the benchmarks.

– Frank
Feb 9 '18 at 22:11

1

@Frank I don't think it is a bad idea necessarily. It depends on what you want to do with it. I included it in the benchmarks to make sure every solution returns the same and thus get a fair comparison.

– Jaap
Feb 10 '18 at 7:04

|
show 2 more comments

score 15 · Accepted Answer · 2018-02-10 09:08:52Z

15

You can use rollapply() from zoo

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)

x <- c(2,3,5,8)



library("zoo")

searchX <- function(x, X) all(x==X)

rollapply(v, FUN=searchX, X=x, width=length(x))

The result TRUEshows you the beginning of the sequence.

The code could be simplified to rollapply(v, length(x), identical, x) (thanks to G. Grothendieck):

set.seed(2)

vl <- as.numeric(sample(1:10, 1e6, TRUE))

# vm <- vl[1:1e5]

# vs <- vl[1:1e4]

x <- c(2,3,5)



library("zoo")

searchX <- function(x, X) all(x==X)

i1 <- rollapply(vl, FUN=searchX, X=x, width=length(x))

i2 <- rollapply(vl, width=length(x), identical, y=x)



identical(i1, i2)

For using identical() both arguments must be of the same type (num and int are not the same).

If needed == coerces int to num; identical() does not any coercion.

edited Feb 10 '18 at 9:08

answered Feb 7 '18 at 10:03

jogo

10k92135

Could you check your 2nd solution? As you can see in the benchmark answer it doesn't return the same output as the other answers.

– Jaap
Feb 9 '18 at 15:04

1

I tried (also unsuccesfully) to repair it as well. I will remove it from the benchmarks.

– Jaap
Feb 9 '18 at 16:30

1

The code could be simplified to rollapply(v, length(x), identical, x) where v and x must be of the same type, e.g. both integer or both double, since for example identical(5L, 5) is FALSE.

– G. Grothendieck
Feb 10 '18 at 3:18

1

@G.Grothendieck Thx, that was indeed the issue. When both are of the same type, the solution with identical works.

– Jaap
Feb 10 '18 at 7:16

add a comment |

score 10 · Accepted Answer · 2018-02-07 21:38:19Z

I feel like looping should be efficient:

w = seq_along(v)

for (i in seq_along(x)) w = w[v[w+i-1L] == x[i]]



w 

# [1]  2 12

This should be writable in C++ following @SymbolixAU approach for extra speed.

A basic comparison:

# create functions for selected approaches

redjaap <- function(v,x)

  which(Reduce("+", Map('==', shift(v, type = 'lead', n = 0:(length(x) - 1)), x)) == length(x))

loop <- function(v,x){

  w = seq_along(v)

  for (i in seq_along(x)) w = w[v[w+i-1L] == x[i]]

  w

}



# check consistency

identical(redjaap(v,x), loop(v,x))

# [1] TRUE



# check speed

library(microbenchmark)

vv <- rep(v, 1e4)

microbenchmark(redjaap(vv,x), loop(vv,x), times = 100)

# Unit: milliseconds

#            expr      min       lq      mean   median       uq       max neval cld

#  redjaap(vv, x) 5.883809 8.058230 17.225899 9.080246 9.907514  96.35226   100   b

#     loop(vv, x) 3.629213 5.080816  9.475016 5.578508 6.495105 112.61242   100  a 



# check consistency again

identical(redjaap(vv,x), loop(vv,x))

# [1] TRUE

this method is really efficient in terms of the amount of code to achieve the objective...can use compiler::cmpfun(frank) for a slight speedup — Feb 22 '18 at 8:03

score 10 · Accepted Answer · 2018-07-16 22:23:19Z

Here are two Rcpp solutions. The first one returns the location of v that is the starting position of the sequence.

library(Rcpp)



v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)

x <- c(2,3,5,8)



cppFunction('NumericVector SeqInVec(NumericVector myVector, NumericVector mySequence) {



    int vecSize = myVector.size();

    int seqSize = mySequence.size();

    NumericVector comparison(seqSize);

    NumericVector res(vecSize);



    for (int i = 0; i < vecSize; i++ ) {



        for (int j = 0; j < seqSize; j++ ) {

                comparison[j] = mySequence[j] == myVector[i + j];

        }



        if (sum(comparison) == seqSize) {

            res[i] = 1;

        }else{

            res[i] = 0;

        }

    }



    return res;



    }')



SeqInVec(v, x)

#[1] 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

This second one returns the index values (as per the other answers) of every matched entry in the sequence.

cppFunction('NumericVector SeqInVec(NumericVector myVector, NumericVector mySequence) {



  int vecSize = myVector.size();

  int seqSize = mySequence.size();

  NumericVector comparison(seqSize);

  NumericVector res(vecSize);

  int foundCounter = 0;



  for (int i = 0; i < vecSize; i++ ) {



    for (int j = 0; j < seqSize; j++ ) {

      comparison[j] = mySequence[j] == myVector[i + j];

    }



    if (sum(comparison) == seqSize) {

      for (int j = 0; j < seqSize; j++ ) {

        res[foundCounter] = i + j + 1;

        foundCounter++;

      }

    }

  }



  IntegerVector idx = seq(0, (foundCounter-1));

  return res[idx];

}')



SeqInVec(v, x)

# [1]  2  3  4  5 12 13 14 15

Optimising

As @MichaelChirico points out in their comment, further optimisations can be made. For example, if we know the first entry in the sequence doesn't match a value in the vector, we don't need to do the rest of the comparison

cppFunction('NumericVector SeqInVecOpt(NumericVector myVector, NumericVector mySequence) {



  int vecSize = myVector.size();

  int seqSize = mySequence.size();

  NumericVector comparison(seqSize);

  NumericVector res(vecSize);

  int foundCounter = 0;



  for (int i = 0; i < vecSize; i++ ) {



    if (myVector[i] == mySequence[0]) {

        for (int j = 0; j < seqSize; j++ ) {

          comparison[j] = mySequence[j] == myVector[i + j];

        }



        if (sum(comparison) == seqSize) {

          for (int j = 0; j < seqSize; j++ ) {

            res[foundCounter] = i + j + 1;

            foundCounter++;

          }

        }

    }

  }



  IntegerVector idx = seq(0, (foundCounter-1));

  return res[idx];

}')

The answer with benchmarks shows the performance of these approaches

Could you update your solution such that it returns the same output as the others? I can then include it in the benchmark. — Feb 8 '18 at 20:14
since you're examining subsequent elements, shouldn't there be a way to optimize by skipping elements we already know don't start the sequence? e.g. in OPs example when checking at the second 2 we already know the 3rd element is not 2 so we can skip checking the elements after 3 — Feb 11 '18 at 11:18
2-3x speed-up, nice! I guess the improvement depends on the length of the "search" string and its density (% of TRUE values). — Feb 12 '18 at 0:09
@MichaelChirico - yes that will likely be a factor. I've also tested a variation where it will increment i by the size of the search string, rather than one each time. In this example I didn't see any improvement, however. — Feb 12 '18 at 0:13

score 8 · Accepted Answer · 2018-07-16 22:22:42Z

A benchmark on the posted answers:

Load the needed packages:

library(data.table)

library(microbenchmark)

library(Rcpp)

library(zoo)

Creating vector with which the benchmarks will be run:

set.seed(2)

vl <- sample(1:10, 1e6, TRUE)

vm <- vl[1:1e5]

vs <- vl[1:1e4]

x <- c(2,3,5)

Testing whether all solution give the same outcome on the small vector vs:

> all.equal(jaap1(vs,x), jaap2(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), docendo(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), a5c1(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), jogo1(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), moody(vs,x))

[1] "Numeric: lengths (24, 873) differ"

> all.equal(jaap1(vs,x), cata1(vs,x))

[1] "Numeric: lengths (24, 0) differ"

> all.equal(jaap1(vs,x), u989(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), frank(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), symb(vs,x))

[1] TRUE

> all.equal(jaap1(vs, x), symbOpt(vs, x))

[1] TRUE

Further inspection of the cata1 and moody solutions learns that they don't give the desired output. They are therefore not included in the benchmarks.

The benchmark for the smallest vector vs:

mbs <- microbenchmark(jaap1(vs,x), jaap2(vs,x), docendo(vs,x), a5c1(vs,x),

                      jogo1(vs,x), u989(vs,x), frank(vs,x), symb(vs,x), symbOpt(vs, x),

                      times = 100)

gives:

 print(mbs, order = "median")



 Unit: microseconds

          expr       min         lq        mean     median         uq        max neval

  symbOpt(vs, x)    40.658    47.0565    78.47119    51.5220    56.2765   2170.708   100

     symb(vs, x)   106.208   112.7885   151.76398   117.0655   123.7450   1976.360   100

    frank(vs, x)   121.303   129.0515   203.13616   132.1115   137.9370   6193.837   100

    jaap2(vs, x)   187.973   218.7805   322.98300   235.0535   255.2275   6287.548   100

    jaap1(vs, x)   306.944   341.4055   452.32426   358.2600   387.7105   6376.805   100

     a5c1(vs, x)   463.721   500.9465   628.13475   516.2845   553.2765   6179.304   100

  docendo(vs, x)  1139.689  1244.0555  1399.88150  1313.6295  1363.3480   9516.529   100

     u989(vs, x)  8048.969  8244.9570  8735.97523  8627.8335  8858.7075  18732.750   100

    jogo1(vs, x) 40022.406 42208.4870 44927.58872 43733.8935 45008.0360 124496.190   100

The benchmark for the medium vector vm:

mbm <- microbenchmark(jaap1(vm,x), jaap2(vm,x), docendo(vm,x), a5c1(vm,x),

                      jogo1(vm,x), u989(vm,x), frank(vm,x), symb(vm,x), symbOpt(vm, x),

                      times = 100)

gives:

print(mbm, order = "median")



Unit: microseconds

           expr        min          lq        mean      median         uq        max neval

 symbOpt(vm, x)    357.452    405.0415    974.9058    763.0205   1067.803   7444.126   100

    symb(vm, x)   1032.915   1117.7585   1923.4040   1422.1930   1753.044  17498.132   100

   frank(vm, x)   1158.744   1470.8170   1829.8024   1826.1330   1935.641   6423.966   100

   jaap2(vm, x)   1622.183   2872.7725   3798.6536   3147.7895   3680.954  14886.765   100

   jaap1(vm, x)   3053.024   4729.6115   7325.3753   5607.8395   6682.814  87151.774   100

    a5c1(vm, x)   5487.547   7458.2025   9612.5545   8137.1255   9420.684  88798.914   100

 docendo(vm, x)  10780.920  11357.7440  13313.6269  12029.1720  13411.026  21984.294   100

    u989(vm, x)  83518.898  84999.6890  88537.9931  87675.3260  90636.674 105681.313   100

   jogo1(vm, x) 471753.735 512979.3840 537232.7003 534780.8050 556866.124 646810.092   100

The benchmark for the largest vector vl:

mbl <- microbenchmark(jaap1(vl,x), jaap2(vl,x), docendo(vl,x), a5c1(vl,x),

                      jogo1(vl,x), u989(vl,x), frank(vl,x), symb(vl,x), symbOpt(vl, x),

                      times = 100)

gives:

  print(mbl, order = "median")



Unit: milliseconds

           expr         min          lq       mean     median         uq       max neval

 symbOpt(vl, x)    4.679646    5.768531   12.30079    6.67608   11.67082  118.3467   100

    symb(vl, x)   11.356392   12.656124   21.27423   13.74856   18.66955  149.9840   100

   frank(vl, x)   13.523963   14.929656   22.70959   17.53589   22.04182  132.6248   100

   jaap2(vl, x)   18.754847   24.968511   37.89915   29.78309   36.47700  145.3471   100

   jaap1(vl, x)   37.047549   52.500684   95.28392   72.89496  138.55008  234.8694   100

    a5c1(vl, x)   54.563389   76.704769  116.89269   89.53974  167.19679  248.9265   100

 docendo(vl, x)  109.824281  124.631557  156.60513  129.64958  145.47547  296.0214   100

    u989(vl, x) 1380.886338 1413.878029 1454.50502 1436.18430 1479.18934 1632.3281   100

   jogo1(vl, x) 4067.106897 4339.005951 4472.46318 4454.89297 4563.08310 5114.4626   100

The used functions of each solution:

jaap1 <- function(v,x) {

  l <- length(x);

  w <- which(rowSums(mapply('==', shift(v, type = 'lead', n = 0:(length(x) - 1)), x) ) == length(x));

  rep(w, each = l) + 0:(l-1)

}



jaap2 <- function(v,x) {

  l <- length(x);

  w <- which(Reduce("+", Map('==', shift(v, type = 'lead', n = 0:(length(x) - 1)), x)) == length(x));

  rep(w, each = l) + 0:(l-1)

}



docendo <- function(v,x) {

  l <- length(x);

  idx <- which(v == x[1]);

  w <- idx[sapply(idx, function(i) all(v[i:(i+(length(x)-1))] == x))];

  rep(w, each = l) + 0:(l-1)

}



a5c1 <- function(v,x) {

  l <- length(x);

  w <- which(colSums(t(embed(v, l)[, l:1]) == x) == l);

  rep(w, each = l) + 0:(l-1)

}



jogo1 <- function(v,x) {

  l <- length(x);

  searchX <- function(x, X) all(x==X);

  w <- which(rollapply(v, FUN=searchX, X=x, width=l));

  rep(w, each = l) + 0:(l-1)

}



moody <- function(v,x) {

  l <- length(x);

  v2 <- as.numeric(factor(c(v,NA),levels = x));

  v2[is.na(v2)] <- l+1;

  which(diff(v2) == 1)

}



cata1 <- function(v,x) {

  l <- length(x);

  w <- which(sapply(lapply(seq(length(v)-l)-1, function(i) v[seq(x)+i]), identical, x));

  rep(w, each = l) + 0:(l-1)

}



u989 <- function(v,x) {

  l <- length(x);

  s <- paste(v, collapse = '-');

  p <- paste0('\b', paste(x, collapse = '-'), '\b');

  i <- c(1, unlist(gregexpr(p, s)));

  m <- substring(s, head(i,-1), tail(i,-1));

  ln <- lengths(strsplit(m, '-'));

  w <- cumsum(c(ln[1], ln[-1]-1));

  rep(w, each = l) + 0:(l-1)

}



frank <- function(v,x) {

  l <- length(x);

  w = seq_along(v);

  for (i in seq_along(x)) w = w[v[w+i-1L] == x[i]];

  rep(w, each = l) + 0:(l-1)

}



cppFunction('NumericVector SeqInVec(NumericVector myVector, NumericVector mySequence) {



            int vecSize = myVector.size();

            int seqSize = mySequence.size();

            NumericVector comparison(seqSize);

            NumericVector res(vecSize);

            int foundCounter = 0;



            for (int i = 0; i < vecSize; i++ ) {



            for (int j = 0; j < seqSize; j++ ) {

            comparison[j] = mySequence[j] == myVector[i + j];

            }



            if (sum(comparison) == seqSize) {

            for (int j = 0; j < seqSize; j++ ) {

            res[foundCounter] = i + j + 1;

            foundCounter++;

            }

            }

            }



            IntegerVector idx = seq(0, (foundCounter-1));

            return res[idx];

            }')



symb <- function(v,x) {SeqInVec(v, x)}



cppFunction('NumericVector SeqInVecOpt(NumericVector myVector, NumericVector mySequence) {



  int vecSize = myVector.size();

  int seqSize = mySequence.size();

  NumericVector comparison(seqSize);

  NumericVector res(vecSize);

  int foundCounter = 0;



  for (int i = 0; i < vecSize; i++ ) {



        if (myVector[i] == mySequence[0]) {

        for (int j = 0; j < seqSize; j++ ) {

          comparison[j] = mySequence[j] == myVector[i + j];

        }



        if (sum(comparison) == seqSize) {

          for (int j = 0; j < seqSize; j++ ) {

            res[foundCounter] = i + j + 1;

            foundCounter++;

          }

        }

        }

  }



  IntegerVector idx = seq(0, (foundCounter-1));

  return res[idx];

}')



symbOpt <- function(v,x) {SeqInVecOpt(v,x)}

Since this is a cw-answer I'll add my own benchmark of some of the answers.

library(data.table)

library(microbenchmark)



set.seed(2); v <- sample(1:100, 5e7, TRUE); x <- c(2,3,5)



jaap1 <- function(v, x) { 

  which(rowSums(mapply('==',shift(v, type = 'lead', n = 0:(length(x) - 1)),

                       x)) == length(x)) 

}



jaap2 <- function(v, x) {

  which(Reduce("+", Map('==',shift(v, type = 'lead', n = 0:(length(x) - 1)),

                        x)) == length(x))

}



dd1 <- function(v, x) {

  idx <- which(v == x[1])

  idx[sapply(idx, function(i) all(v[i:(i+(length(x)-1))] == x))]

}



dd2 <- function(v, x) {

  idx <- which(v == x[1L])

  xl <- length(x) - 1L

  idx[sapply(idx, function(i) all(v[i:(i+xl)] == x))]

}



frank <- function(v, x) {

  w = seq_along(v)

  for (i in seq_along(x)) w = w[v[w+i-1L] == x[i]]

  w 

}



all.equal(jaap1(v, x), dd1(v, x))

all.equal(jaap2(v, x), dd1(v, x))

all.equal(dd2(v, x), dd1(v, x))

all.equal(frank(v, x), dd1(v, x))



bm <- microbenchmark(jaap1(v, x), jaap2(v, x), dd1(v, x), dd2(v, x), frank(v, x), 

                     unit = "relative", times = 25)



plot(bm)

Imgur

bm

Unit: relative

        expr      min       lq     mean   median       uq       max neval

 jaap1(v, x) 4.487360 4.591961 4.724153 4.870226 4.660023 3.9361093    25

 jaap2(v, x) 2.026052 2.159902 2.116204 2.282644 2.138106 2.1133068    25

   dd1(v, x) 1.078059 1.151530 1.119067 1.257337 1.201762 0.8646835    25

   dd2(v, x) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000    25

 frank(v, x) 1.400735 1.376405 1.442887 1.427433 1.611672 1.3440097    25

Bottom line: without knowing the real data, all these benchmarks don't tell the whole story.

@docendodiscimus - could you update with the data you've used in your benchmarks? — Feb 11 '18 at 21:33
@SymbolixAU, yes of course. Sorry, I thought I had done that already. — Feb 12 '18 at 7:52
My answer in base R is (on average) 4x times faster than jogo's answer with the help of a library. I have got +2/-2 votes and his answer +15. Hmmm :-/ — Feb 12 '18 at 9:47
@989 - I wouldn't take it personally; after the initial flurry of activity & votes, people don't often re-visit questions, which also means down-votes often won't get removed even if you improve the answer. — Feb 13 '18 at 3:46

Matt SummersgillMatt Summersgill 1,950523 · Accepted Answer · 2018-02-09 22:23:53Z

Here's a solution that leverages binary search on secondary indices in data.table. (Great vignette here)

This method has quite a bit of overhead so it's not particularly competitive on the 1e4 length vector in the benchmark, but it hangs near the top of the pack as the size increases.

Hats off to everyone else posting solutions, learning a lot from this question.

matt <- function(v,x){

  l <- length(x);

  SL <- seq_len(l-1);

  DT <- data.table(Seq_0 = v);

  for (i in SL) set(DT, j = eval(paste0("Seq_",i)), value = shift(DT[["Seq_0"]],n = i, type = "lead"));

  w <- DT[as.list(x),on = paste0("Seq_",c(0L,SL)), which = TRUE];

  rep(w, each = l) + 0:(l-1)

}

Benchmarking

library(data.table)

library(microbenchmark)

library(Rcpp)

library(zoo)



set.seed(2)

vl <- sample(1:10, 1e6, TRUE)

vm <- vl[1:1e5]

vs <- vl[1:1e4]

x <- c(2,3,5)

Vector Length 1e4

Unit: microseconds

           expr       min        lq       mean     median        uq       max neval

    symb(vs, x)   138.342   143.048   161.6681   153.1545   159.269   259.999    10

   frank(vs, x)   176.634   184.129   198.8060   193.2850   200.701   257.050    10

   jaap2(vs, x)   282.231   299.025   342.5323   316.5185   337.760   524.212    10

   jaap1(vs, x)   490.013   528.123   568.6168   538.7595   547.268   731.340    10

    a5c1(vs, x)   706.450   742.270   751.3092   756.2075   758.859   793.446    10

     dd2(vs, x)  1319.098  1348.082  2061.5579  1363.2265  1497.960  7913.383    10

 docendo(vs, x)  1427.768  1459.484  1536.6439  1546.2135  1595.858  1696.070    10

     dd1(vs, x)  1377.502  1406.272  2217.2382  1552.5030  1706.131  8084.474    10

    matt(vs, x)  1928.418  2041.597  2390.6227  2087.6335  2430.470  4762.909    10

    u989(vs, x)  8720.330  8821.987  8935.7188  8882.0190  9106.705  9163.967    10

   jogo1(vs, x) 47123.615 47536.700 49158.2600 48449.2390 50957.035 52496.981    10

Vector Length 1e5

Unit: milliseconds

           expr        min         lq       mean     median         uq        max neval

    symb(vm, x)   1.319921   1.378801   1.464972   1.423782   1.577006   1.682156    10

   frank(vm, x)   1.671155   1.739507   1.806548   1.760738   1.844893   2.097404    10

   jaap2(vm, x)   2.298449   2.380281   2.683813   2.432373   2.566581   4.310258    10

    matt(vm, x)   3.195048   3.495247   3.577080   3.607060   3.687222   3.844508    10

   jaap1(vm, x)   4.079117   4.179975   4.776989   4.496603   5.206452   6.295954    10

    a5c1(vm, x)   6.488621   6.617709   7.366226   6.720107   6.877529  12.500510    10

     dd2(vm, x)  12.595699  12.812876  14.990739  14.058098  16.758380  20.743506    10

 docendo(vm, x)  13.635357  13.999721  15.296075  14.729947  16.151790  18.541582    10

     dd1(vm, x)  13.474589  14.177410  15.676348  15.446635  17.150199  19.085379    10

    u989(vm, x)  94.844298  95.026733  96.309658  95.134400  97.460869 100.536654    10

   jogo1(vm, x) 575.230741 581.654544 621.824297 616.474265 628.267155 723.010738    10

Vector Length 1e6

Unit: milliseconds

           expr        min         lq       mean     median         uq        max neval

    symb(vl, x)   13.34294   13.55564   14.01556   13.61847   14.78210   15.26076    10

   frank(vl, x)   17.35628   17.45602   18.62781   17.56914   17.88896   25.38812    10

    matt(vl, x)   20.79867   21.07157   22.41467   21.23878   22.56063   27.12909    10

   jaap2(vl, x)   22.81464   22.92414   22.96956   22.99085   23.02558   23.10124    10

   jaap1(vl, x)   40.00971   40.46594   43.01407   41.03370   42.81724   55.90530    10

    a5c1(vl, x)   65.39460   65.97406   69.27288   66.28000   66.72847   83.77490    10

     dd2(vl, x)  127.47617  132.99154  161.85129  134.63168  157.40028  342.37526    10

     dd1(vl, x)  140.06140  145.45085  154.88780  154.23280  161.90710  171.60294    10

 docendo(vl, x)  147.07644  151.58861  162.20522  162.49216  165.49513  183.64135    10

    u989(vl, x) 2022.64476 2041.55442 2055.86929 2054.92627 2066.26187 2088.71411    10

   jogo1(vl, x) 5563.31171 5632.17506 5863.56265 5872.61793 6016.62838 6244.63205    10

score 2 · Accepted Answer · 2018-02-08 19:55:00Z

2

Here is a string-based approach in base R:

str <- paste(v, collapse = '-')

# "2-2-3-5-8-0-32-1-3-12-5-2-3-5-8-33-1"



pattern <- paste0('\b', paste(x, collapse = '-'), '\b')

# "\b2-3-5-8\b"



inds <- unlist(gregexpr(pattern, str)) # (1)

# 3 25

sapply(inds, function(i) lengths(strsplit(substr(str, 1, i),'-'))) # (2)



# [1]  2 12

\b is used for exact matching.

(1) Finds the positions at which pattern is seen within str.

(2) Getting back the respective indices within the original vector v.

UPDATE

As for the discussion of running-time efficiency, here is a much faster solution than my first solution:

str <- paste(v, collapse = '-')

pattern <- paste0('\b', paste(x, collapse = '-'), '\b')



inds <- c(1, unlist(gregexpr(pattern, str)))



m <- substring(str, head(inds,-1), tail(inds,-1))

ln <- lengths(strsplit(m, '-'))

cumsum(c(ln[1], ln[-1]-1))

edited Feb 8 '18 at 19:55

answered Feb 7 '18 at 15:51

989

8,98751834

2

I've updated the benchmarks and only included your fastest solution.

– Jaap
Feb 8 '18 at 19:35

I looked at what they return and then adjusted the solutions such that all would give the same result (didn't programmatically check it though)

– Jaap
Feb 8 '18 at 19:55

included now :-)

– Jaap
Feb 8 '18 at 20:11

thx for notifying, changed the construction of the vectors a bit; now it should return a normal vector :-)

– Jaap
Feb 8 '18 at 20:26

please leave a note under the respective answers so they can improve; could you check my benchmarking codes? it could as well that I made a mistake somewhere

– Jaap
Feb 8 '18 at 20:39

add a comment |

score 1 · Accepted Answer · 2018-02-12 15:36:45Z

1

EDIT: some have noted that my answer doesn't always give the desired output, I might fix it later, caution meanwhile!

We can convert v to factors and keep only consecutive values in our transformed vector:

v2 <- as.numeric(factor(c(v,NA),levels = x)) # [1]  1  1  2  3  4 NA NA NA ...

v2[is.na(v2)] <- length(x)+1                 # [1]  1  1  2  3  4  5  5  5 ...

output <- diff(v2) ==1

# [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

data

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)

x <- c(2,3,5,8)

edited Feb 12 '18 at 15:36

answered Feb 7 '18 at 13:03

Moody_Mudskipper

22.7k32964

1

that's pretty computationally intensive.

– Carl Witthoft
Feb 7 '18 at 14:30

is it ? I don't know, it's the only fully vectorized solution so far, too many copies ?

– Moody_Mudskipper
Feb 7 '18 at 14:33

I plead guilty to not having run microbenchmark on the various answers here. It's just a gut feeling because of the number of class coercions going on there.

– Carl Witthoft
Feb 7 '18 at 14:40

@CarlWitthoft, I guess that the answers by catastrophic-failure, which both utilise nested loops, will be much slower. But I too haven't tested any.

– docendo discimus
Feb 7 '18 at 14:51

1

@docendodiscimus see my latest benchmarks

– Carl Witthoft
Feb 7 '18 at 15:32

add a comment |

score 21 · Accepted Answer · 2018-02-07 10:08:59Z

21

Using base R you could do the following:

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)

x <- c(2,3,5,8)



idx <- which(v == x[1])

idx[sapply(idx, function(i) all(v[i:(i+(length(x)-1))] == x))]

# [1]  2 12

This tells you that the exact sequence appears twice, starting at positions 2 and 12 of your vector v.

It first checks the possible starting positions, i.e. where v equals the first value of x and then loops through these positions to check if the values after these positions also equal the other values of x.

edited Feb 7 '18 at 10:08

answered Feb 7 '18 at 10:05

docendo discimus

51.7k1178116

I was going to suggest something like which(colSums(t(embed(v, length(x))[, length(x):1]) == x) == length(x)), but I think this is easy to follow....

– A5C1D2H2I1M1N2O1R2T1
Feb 7 '18 at 10:06

@A5C1D2H2I1M1N2O1R2T1, that looks indeed a little hard to follow

– docendo discimus
Feb 7 '18 at 10:10

2

Worth noting, idx <- which(v == x[1]) is an important step. While other answers are going through all 1:4 shift variations 14 times, this answer does it in 3 steps.

– zx8754
Feb 7 '18 at 10:51

@zx8754, ...but the "data.table" approach still manages to win in terms of speed in a couple of tests I did with larger vecs....

– A5C1D2H2I1M1N2O1R2T1
Feb 7 '18 at 10:52

add a comment |

score 16 · Accepted Answer · 2018-02-09 15:02:38Z

16

Two other approaches using the shift-function trom data.table:

library(data.table)



# option 1

which(rowSums(mapply('==',

                     shift(v, type = 'lead', n = 0:(length(x) - 1)),

                     x)

              ) == length(x))



# option 2

which(Reduce("+", Map('==',

                      shift(v, type = 'lead', n = 0:(length(x) - 1)),

                      x)

             ) == length(x))

both give:

[1]  2 12

To get a full vector of the matching positions:

l <- length(x)

w <- which(Reduce("+", Map('==',

                           shift(v, type = 'lead', n = 0:(l - 1)),

                           x)

                  ) == l)

rep(w, each = l) + 0:(l-1)

which gives:

[1]  2  3  4  5 12 13 14 15

The benchmark which was included earlier in this answer has been moved to a separate community wiki answer.

Used data:

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)

x <- c(2,3,5,8)

edited Feb 9 '18 at 15:02

answered Feb 7 '18 at 10:24

Jaap

56.2k20119132

1

Many of these solutions don't give the desired output, the extra step is not cost free

– Moody_Mudskipper
Feb 8 '18 at 11:29

4

@989 I will update, but didn't have the time yet. What I do not understand is that you downvote me but don't downvote the invalid answers. What's the reason for that? Furthermore: why didn't you comment under the invalid answers so that they get a chance to improve?

– Jaap
Feb 9 '18 at 10:46

2

@989 You could always suggest Edit to this post or provide your own benchmark on your own post with explanation why Jaap's is wrong. No need for this kind of tone.

– zx8754
Feb 9 '18 at 10:57

1

Not sure whose idea it was, but getting a full vector of matching positions concatenated together seems like a bad idea. If I test with x = c(1,1,1) then I may find positions appearing multiple times. Besides, it's redundant -- the informational content of the first position is enough... Anyway, not a big deal, just my two cents ... noticed it all over the benchmarks.

– Frank
Feb 9 '18 at 22:11

1

@Frank I don't think it is a bad idea necessarily. It depends on what you want to do with it. I included it in the benchmarks to make sure every solution returns the same and thus get a fair comparison.

– Jaap
Feb 10 '18 at 7:04

|
show 2 more comments

score 15 · Accepted Answer · 2018-02-10 09:08:52Z

15

You can use rollapply() from zoo

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)

x <- c(2,3,5,8)



library("zoo")

searchX <- function(x, X) all(x==X)

rollapply(v, FUN=searchX, X=x, width=length(x))

The result TRUEshows you the beginning of the sequence.

The code could be simplified to rollapply(v, length(x), identical, x) (thanks to G. Grothendieck):

set.seed(2)

vl <- as.numeric(sample(1:10, 1e6, TRUE))

# vm <- vl[1:1e5]

# vs <- vl[1:1e4]

x <- c(2,3,5)



library("zoo")

searchX <- function(x, X) all(x==X)

i1 <- rollapply(vl, FUN=searchX, X=x, width=length(x))

i2 <- rollapply(vl, width=length(x), identical, y=x)



identical(i1, i2)

For using identical() both arguments must be of the same type (num and int are not the same).

If needed == coerces int to num; identical() does not any coercion.

edited Feb 10 '18 at 9:08

answered Feb 7 '18 at 10:03

jogo

10k92135

Could you check your 2nd solution? As you can see in the benchmark answer it doesn't return the same output as the other answers.

– Jaap
Feb 9 '18 at 15:04

1

I tried (also unsuccesfully) to repair it as well. I will remove it from the benchmarks.

– Jaap
Feb 9 '18 at 16:30

1

The code could be simplified to rollapply(v, length(x), identical, x) where v and x must be of the same type, e.g. both integer or both double, since for example identical(5L, 5) is FALSE.

– G. Grothendieck
Feb 10 '18 at 3:18

1

@G.Grothendieck Thx, that was indeed the issue. When both are of the same type, the solution with identical works.

– Jaap
Feb 10 '18 at 7:16

add a comment |

score 10 · Accepted Answer · 2018-02-07 21:38:19Z

I feel like looping should be efficient:

w = seq_along(v)

for (i in seq_along(x)) w = w[v[w+i-1L] == x[i]]



w 

# [1]  2 12

This should be writable in C++ following @SymbolixAU approach for extra speed.

A basic comparison:

# create functions for selected approaches

redjaap <- function(v,x)

  which(Reduce("+", Map('==', shift(v, type = 'lead', n = 0:(length(x) - 1)), x)) == length(x))

loop <- function(v,x){

  w = seq_along(v)

  for (i in seq_along(x)) w = w[v[w+i-1L] == x[i]]

  w

}



# check consistency

identical(redjaap(v,x), loop(v,x))

# [1] TRUE



# check speed

library(microbenchmark)

vv <- rep(v, 1e4)

microbenchmark(redjaap(vv,x), loop(vv,x), times = 100)

# Unit: milliseconds

#            expr      min       lq      mean   median       uq       max neval cld

#  redjaap(vv, x) 5.883809 8.058230 17.225899 9.080246 9.907514  96.35226   100   b

#     loop(vv, x) 3.629213 5.080816  9.475016 5.578508 6.495105 112.61242   100  a 



# check consistency again

identical(redjaap(vv,x), loop(vv,x))

# [1] TRUE

this method is really efficient in terms of the amount of code to achieve the objective...can use compiler::cmpfun(frank) for a slight speedup — Feb 22 '18 at 8:03

score 10 · Accepted Answer · 2018-07-16 22:23:19Z

Here are two Rcpp solutions. The first one returns the location of v that is the starting position of the sequence.

library(Rcpp)



v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)

x <- c(2,3,5,8)



cppFunction('NumericVector SeqInVec(NumericVector myVector, NumericVector mySequence) {



    int vecSize = myVector.size();

    int seqSize = mySequence.size();

    NumericVector comparison(seqSize);

    NumericVector res(vecSize);



    for (int i = 0; i < vecSize; i++ ) {



        for (int j = 0; j < seqSize; j++ ) {

                comparison[j] = mySequence[j] == myVector[i + j];

        }



        if (sum(comparison) == seqSize) {

            res[i] = 1;

        }else{

            res[i] = 0;

        }

    }



    return res;



    }')



SeqInVec(v, x)

#[1] 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

This second one returns the index values (as per the other answers) of every matched entry in the sequence.

cppFunction('NumericVector SeqInVec(NumericVector myVector, NumericVector mySequence) {



  int vecSize = myVector.size();

  int seqSize = mySequence.size();

  NumericVector comparison(seqSize);

  NumericVector res(vecSize);

  int foundCounter = 0;



  for (int i = 0; i < vecSize; i++ ) {



    for (int j = 0; j < seqSize; j++ ) {

      comparison[j] = mySequence[j] == myVector[i + j];

    }



    if (sum(comparison) == seqSize) {

      for (int j = 0; j < seqSize; j++ ) {

        res[foundCounter] = i + j + 1;

        foundCounter++;

      }

    }

  }



  IntegerVector idx = seq(0, (foundCounter-1));

  return res[idx];

}')



SeqInVec(v, x)

# [1]  2  3  4  5 12 13 14 15

Optimising

As @MichaelChirico points out in their comment, further optimisations can be made. For example, if we know the first entry in the sequence doesn't match a value in the vector, we don't need to do the rest of the comparison

cppFunction('NumericVector SeqInVecOpt(NumericVector myVector, NumericVector mySequence) {



  int vecSize = myVector.size();

  int seqSize = mySequence.size();

  NumericVector comparison(seqSize);

  NumericVector res(vecSize);

  int foundCounter = 0;



  for (int i = 0; i < vecSize; i++ ) {



    if (myVector[i] == mySequence[0]) {

        for (int j = 0; j < seqSize; j++ ) {

          comparison[j] = mySequence[j] == myVector[i + j];

        }



        if (sum(comparison) == seqSize) {

          for (int j = 0; j < seqSize; j++ ) {

            res[foundCounter] = i + j + 1;

            foundCounter++;

          }

        }

    }

  }



  IntegerVector idx = seq(0, (foundCounter-1));

  return res[idx];

}')

The answer with benchmarks shows the performance of these approaches

Could you update your solution such that it returns the same output as the others? I can then include it in the benchmark. — Feb 8 '18 at 20:14
since you're examining subsequent elements, shouldn't there be a way to optimize by skipping elements we already know don't start the sequence? e.g. in OPs example when checking at the second 2 we already know the 3rd element is not 2 so we can skip checking the elements after 3 — Feb 11 '18 at 11:18
2-3x speed-up, nice! I guess the improvement depends on the length of the "search" string and its density (% of TRUE values). — Feb 12 '18 at 0:09
@MichaelChirico - yes that will likely be a factor. I've also tested a variation where it will increment i by the size of the search string, rather than one each time. In this example I didn't see any improvement, however. — Feb 12 '18 at 0:13

score 8 · Accepted Answer · 2018-07-16 22:22:42Z

A benchmark on the posted answers:

Load the needed packages:

library(data.table)

library(microbenchmark)

library(Rcpp)

library(zoo)

Creating vector with which the benchmarks will be run:

set.seed(2)

vl <- sample(1:10, 1e6, TRUE)

vm <- vl[1:1e5]

vs <- vl[1:1e4]

x <- c(2,3,5)

Testing whether all solution give the same outcome on the small vector vs:

> all.equal(jaap1(vs,x), jaap2(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), docendo(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), a5c1(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), jogo1(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), moody(vs,x))

[1] "Numeric: lengths (24, 873) differ"

> all.equal(jaap1(vs,x), cata1(vs,x))

[1] "Numeric: lengths (24, 0) differ"

> all.equal(jaap1(vs,x), u989(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), frank(vs,x))

[1] TRUE

> all.equal(jaap1(vs,x), symb(vs,x))

[1] TRUE

> all.equal(jaap1(vs, x), symbOpt(vs, x))

[1] TRUE

Further inspection of the cata1 and moody solutions learns that they don't give the desired output. They are therefore not included in the benchmarks.

The benchmark for the smallest vector vs:

mbs <- microbenchmark(jaap1(vs,x), jaap2(vs,x), docendo(vs,x), a5c1(vs,x),

                      jogo1(vs,x), u989(vs,x), frank(vs,x), symb(vs,x), symbOpt(vs, x),

                      times = 100)

gives:

 print(mbs, order = "median")



 Unit: microseconds

          expr       min         lq        mean     median         uq        max neval

  symbOpt(vs, x)    40.658    47.0565    78.47119    51.5220    56.2765   2170.708   100

     symb(vs, x)   106.208   112.7885   151.76398   117.0655   123.7450   1976.360   100

    frank(vs, x)   121.303   129.0515   203.13616   132.1115   137.9370   6193.837   100

    jaap2(vs, x)   187.973   218.7805   322.98300   235.0535   255.2275   6287.548   100

    jaap1(vs, x)   306.944   341.4055   452.32426   358.2600   387.7105   6376.805   100

     a5c1(vs, x)   463.721   500.9465   628.13475   516.2845   553.2765   6179.304   100

  docendo(vs, x)  1139.689  1244.0555  1399.88150  1313.6295  1363.3480   9516.529   100

     u989(vs, x)  8048.969  8244.9570  8735.97523  8627.8335  8858.7075  18732.750   100

    jogo1(vs, x) 40022.406 42208.4870 44927.58872 43733.8935 45008.0360 124496.190   100

The benchmark for the medium vector vm:

mbm <- microbenchmark(jaap1(vm,x), jaap2(vm,x), docendo(vm,x), a5c1(vm,x),

                      jogo1(vm,x), u989(vm,x), frank(vm,x), symb(vm,x), symbOpt(vm, x),

                      times = 100)

gives:

print(mbm, order = "median")



Unit: microseconds

           expr        min          lq        mean      median         uq        max neval

 symbOpt(vm, x)    357.452    405.0415    974.9058    763.0205   1067.803   7444.126   100

    symb(vm, x)   1032.915   1117.7585   1923.4040   1422.1930   1753.044  17498.132   100

   frank(vm, x)   1158.744   1470.8170   1829.8024   1826.1330   1935.641   6423.966   100

   jaap2(vm, x)   1622.183   2872.7725   3798.6536   3147.7895   3680.954  14886.765   100

   jaap1(vm, x)   3053.024   4729.6115   7325.3753   5607.8395   6682.814  87151.774   100

    a5c1(vm, x)   5487.547   7458.2025   9612.5545   8137.1255   9420.684  88798.914   100

 docendo(vm, x)  10780.920  11357.7440  13313.6269  12029.1720  13411.026  21984.294   100

    u989(vm, x)  83518.898  84999.6890  88537.9931  87675.3260  90636.674 105681.313   100

   jogo1(vm, x) 471753.735 512979.3840 537232.7003 534780.8050 556866.124 646810.092   100

The benchmark for the largest vector vl:

mbl <- microbenchmark(jaap1(vl,x), jaap2(vl,x), docendo(vl,x), a5c1(vl,x),

                      jogo1(vl,x), u989(vl,x), frank(vl,x), symb(vl,x), symbOpt(vl, x),

                      times = 100)

gives:

  print(mbl, order = "median")



Unit: milliseconds

           expr         min          lq       mean     median         uq       max neval

 symbOpt(vl, x)    4.679646    5.768531   12.30079    6.67608   11.67082  118.3467   100

    symb(vl, x)   11.356392   12.656124   21.27423   13.74856   18.66955  149.9840   100

   frank(vl, x)   13.523963   14.929656   22.70959   17.53589   22.04182  132.6248   100

   jaap2(vl, x)   18.754847   24.968511   37.89915   29.78309   36.47700  145.3471   100

   jaap1(vl, x)   37.047549   52.500684   95.28392   72.89496  138.55008  234.8694   100

    a5c1(vl, x)   54.563389   76.704769  116.89269   89.53974  167.19679  248.9265   100

 docendo(vl, x)  109.824281  124.631557  156.60513  129.64958  145.47547  296.0214   100

    u989(vl, x) 1380.886338 1413.878029 1454.50502 1436.18430 1479.18934 1632.3281   100

   jogo1(vl, x) 4067.106897 4339.005951 4472.46318 4454.89297 4563.08310 5114.4626   100

The used functions of each solution:

jaap1 <- function(v,x) {

  l <- length(x);

  w <- which(rowSums(mapply('==', shift(v, type = 'lead', n = 0:(length(x) - 1)), x) ) == length(x));

  rep(w, each = l) + 0:(l-1)

}



jaap2 <- function(v,x) {

  l <- length(x);

  w <- which(Reduce("+", Map('==', shift(v, type = 'lead', n = 0:(length(x) - 1)), x)) == length(x));

  rep(w, each = l) + 0:(l-1)

}



docendo <- function(v,x) {

  l <- length(x);

  idx <- which(v == x[1]);

  w <- idx[sapply(idx, function(i) all(v[i:(i+(length(x)-1))] == x))];

  rep(w, each = l) + 0:(l-1)

}



a5c1 <- function(v,x) {

  l <- length(x);

  w <- which(colSums(t(embed(v, l)[, l:1]) == x) == l);

  rep(w, each = l) + 0:(l-1)

}



jogo1 <- function(v,x) {

  l <- length(x);

  searchX <- function(x, X) all(x==X);

  w <- which(rollapply(v, FUN=searchX, X=x, width=l));

  rep(w, each = l) + 0:(l-1)

}



moody <- function(v,x) {

  l <- length(x);

  v2 <- as.numeric(factor(c(v,NA),levels = x));

  v2[is.na(v2)] <- l+1;

  which(diff(v2) == 1)

}



cata1 <- function(v,x) {

  l <- length(x);

  w <- which(sapply(lapply(seq(length(v)-l)-1, function(i) v[seq(x)+i]), identical, x));

  rep(w, each = l) + 0:(l-1)

}



u989 <- function(v,x) {

  l <- length(x);

  s <- paste(v, collapse = '-');

  p <- paste0('\b', paste(x, collapse = '-'), '\b');

  i <- c(1, unlist(gregexpr(p, s)));

  m <- substring(s, head(i,-1), tail(i,-1));

  ln <- lengths(strsplit(m, '-'));

  w <- cumsum(c(ln[1], ln[-1]-1));

  rep(w, each = l) + 0:(l-1)

}



frank <- function(v,x) {

  l <- length(x);

  w = seq_along(v);

  for (i in seq_along(x)) w = w[v[w+i-1L] == x[i]];

  rep(w, each = l) + 0:(l-1)

}



cppFunction('NumericVector SeqInVec(NumericVector myVector, NumericVector mySequence) {



            int vecSize = myVector.size();

            int seqSize = mySequence.size();

            NumericVector comparison(seqSize);

            NumericVector res(vecSize);

            int foundCounter = 0;



            for (int i = 0; i < vecSize; i++ ) {



            for (int j = 0; j < seqSize; j++ ) {

            comparison[j] = mySequence[j] == myVector[i + j];

            }



            if (sum(comparison) == seqSize) {

            for (int j = 0; j < seqSize; j++ ) {

            res[foundCounter] = i + j + 1;

            foundCounter++;

            }

            }

            }



            IntegerVector idx = seq(0, (foundCounter-1));

            return res[idx];

            }')



symb <- function(v,x) {SeqInVec(v, x)}



cppFunction('NumericVector SeqInVecOpt(NumericVector myVector, NumericVector mySequence) {



  int vecSize = myVector.size();

  int seqSize = mySequence.size();

  NumericVector comparison(seqSize);

  NumericVector res(vecSize);

  int foundCounter = 0;



  for (int i = 0; i < vecSize; i++ ) {



        if (myVector[i] == mySequence[0]) {

        for (int j = 0; j < seqSize; j++ ) {

          comparison[j] = mySequence[j] == myVector[i + j];

        }



        if (sum(comparison) == seqSize) {

          for (int j = 0; j < seqSize; j++ ) {

            res[foundCounter] = i + j + 1;

            foundCounter++;

          }

        }

        }

  }



  IntegerVector idx = seq(0, (foundCounter-1));

  return res[idx];

}')



symbOpt <- function(v,x) {SeqInVecOpt(v,x)}

Since this is a cw-answer I'll add my own benchmark of some of the answers.

library(data.table)

library(microbenchmark)



set.seed(2); v <- sample(1:100, 5e7, TRUE); x <- c(2,3,5)



jaap1 <- function(v, x) { 

  which(rowSums(mapply('==',shift(v, type = 'lead', n = 0:(length(x) - 1)),

                       x)) == length(x)) 

}



jaap2 <- function(v, x) {

  which(Reduce("+", Map('==',shift(v, type = 'lead', n = 0:(length(x) - 1)),

                        x)) == length(x))

}



dd1 <- function(v, x) {

  idx <- which(v == x[1])

  idx[sapply(idx, function(i) all(v[i:(i+(length(x)-1))] == x))]

}



dd2 <- function(v, x) {

  idx <- which(v == x[1L])

  xl <- length(x) - 1L

  idx[sapply(idx, function(i) all(v[i:(i+xl)] == x))]

}



frank <- function(v, x) {

  w = seq_along(v)

  for (i in seq_along(x)) w = w[v[w+i-1L] == x[i]]

  w 

}



all.equal(jaap1(v, x), dd1(v, x))

all.equal(jaap2(v, x), dd1(v, x))

all.equal(dd2(v, x), dd1(v, x))

all.equal(frank(v, x), dd1(v, x))



bm <- microbenchmark(jaap1(v, x), jaap2(v, x), dd1(v, x), dd2(v, x), frank(v, x), 

                     unit = "relative", times = 25)



plot(bm)

Imgur

bm

Unit: relative

        expr      min       lq     mean   median       uq       max neval

 jaap1(v, x) 4.487360 4.591961 4.724153 4.870226 4.660023 3.9361093    25

 jaap2(v, x) 2.026052 2.159902 2.116204 2.282644 2.138106 2.1133068    25

   dd1(v, x) 1.078059 1.151530 1.119067 1.257337 1.201762 0.8646835    25

   dd2(v, x) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000    25

 frank(v, x) 1.400735 1.376405 1.442887 1.427433 1.611672 1.3440097    25

Bottom line: without knowing the real data, all these benchmarks don't tell the whole story.

@docendodiscimus - could you update with the data you've used in your benchmarks? — Feb 11 '18 at 21:33
@SymbolixAU, yes of course. Sorry, I thought I had done that already. — Feb 12 '18 at 7:52
My answer in base R is (on average) 4x times faster than jogo's answer with the help of a library. I have got +2/-2 votes and his answer +15. Hmmm :-/ — Feb 12 '18 at 9:47
@989 - I wouldn't take it personally; after the initial flurry of activity & votes, people don't often re-visit questions, which also means down-votes often won't get removed even if you improve the answer. — Feb 13 '18 at 3:46

Matt SummersgillMatt Summersgill 1,950523 · Accepted Answer · 2018-02-09 22:23:53Z

Here's a solution that leverages binary search on secondary indices in data.table. (Great vignette here)

This method has quite a bit of overhead so it's not particularly competitive on the 1e4 length vector in the benchmark, but it hangs near the top of the pack as the size increases.

Hats off to everyone else posting solutions, learning a lot from this question.

matt <- function(v,x){

  l <- length(x);

  SL <- seq_len(l-1);

  DT <- data.table(Seq_0 = v);

  for (i in SL) set(DT, j = eval(paste0("Seq_",i)), value = shift(DT[["Seq_0"]],n = i, type = "lead"));

  w <- DT[as.list(x),on = paste0("Seq_",c(0L,SL)), which = TRUE];

  rep(w, each = l) + 0:(l-1)

}

Benchmarking

library(data.table)

library(microbenchmark)

library(Rcpp)

library(zoo)



set.seed(2)

vl <- sample(1:10, 1e6, TRUE)

vm <- vl[1:1e5]

vs <- vl[1:1e4]

x <- c(2,3,5)

Vector Length 1e4

Unit: microseconds

           expr       min        lq       mean     median        uq       max neval

    symb(vs, x)   138.342   143.048   161.6681   153.1545   159.269   259.999    10

   frank(vs, x)   176.634   184.129   198.8060   193.2850   200.701   257.050    10

   jaap2(vs, x)   282.231   299.025   342.5323   316.5185   337.760   524.212    10

   jaap1(vs, x)   490.013   528.123   568.6168   538.7595   547.268   731.340    10

    a5c1(vs, x)   706.450   742.270   751.3092   756.2075   758.859   793.446    10

     dd2(vs, x)  1319.098  1348.082  2061.5579  1363.2265  1497.960  7913.383    10

 docendo(vs, x)  1427.768  1459.484  1536.6439  1546.2135  1595.858  1696.070    10

     dd1(vs, x)  1377.502  1406.272  2217.2382  1552.5030  1706.131  8084.474    10

    matt(vs, x)  1928.418  2041.597  2390.6227  2087.6335  2430.470  4762.909    10

    u989(vs, x)  8720.330  8821.987  8935.7188  8882.0190  9106.705  9163.967    10

   jogo1(vs, x) 47123.615 47536.700 49158.2600 48449.2390 50957.035 52496.981    10

Vector Length 1e5

Unit: milliseconds

           expr        min         lq       mean     median         uq        max neval

    symb(vm, x)   1.319921   1.378801   1.464972   1.423782   1.577006   1.682156    10

   frank(vm, x)   1.671155   1.739507   1.806548   1.760738   1.844893   2.097404    10

   jaap2(vm, x)   2.298449   2.380281   2.683813   2.432373   2.566581   4.310258    10

    matt(vm, x)   3.195048   3.495247   3.577080   3.607060   3.687222   3.844508    10

   jaap1(vm, x)   4.079117   4.179975   4.776989   4.496603   5.206452   6.295954    10

    a5c1(vm, x)   6.488621   6.617709   7.366226   6.720107   6.877529  12.500510    10

     dd2(vm, x)  12.595699  12.812876  14.990739  14.058098  16.758380  20.743506    10

 docendo(vm, x)  13.635357  13.999721  15.296075  14.729947  16.151790  18.541582    10

     dd1(vm, x)  13.474589  14.177410  15.676348  15.446635  17.150199  19.085379    10

    u989(vm, x)  94.844298  95.026733  96.309658  95.134400  97.460869 100.536654    10

   jogo1(vm, x) 575.230741 581.654544 621.824297 616.474265 628.267155 723.010738    10

Vector Length 1e6

Unit: milliseconds

           expr        min         lq       mean     median         uq        max neval

    symb(vl, x)   13.34294   13.55564   14.01556   13.61847   14.78210   15.26076    10

   frank(vl, x)   17.35628   17.45602   18.62781   17.56914   17.88896   25.38812    10

    matt(vl, x)   20.79867   21.07157   22.41467   21.23878   22.56063   27.12909    10

   jaap2(vl, x)   22.81464   22.92414   22.96956   22.99085   23.02558   23.10124    10

   jaap1(vl, x)   40.00971   40.46594   43.01407   41.03370   42.81724   55.90530    10

    a5c1(vl, x)   65.39460   65.97406   69.27288   66.28000   66.72847   83.77490    10

     dd2(vl, x)  127.47617  132.99154  161.85129  134.63168  157.40028  342.37526    10

     dd1(vl, x)  140.06140  145.45085  154.88780  154.23280  161.90710  171.60294    10

 docendo(vl, x)  147.07644  151.58861  162.20522  162.49216  165.49513  183.64135    10

    u989(vl, x) 2022.64476 2041.55442 2055.86929 2054.92627 2066.26187 2088.71411    10

   jogo1(vl, x) 5563.31171 5632.17506 5863.56265 5872.61793 6016.62838 6244.63205    10

score 2 · Accepted Answer · 2018-02-08 19:55:00Z

2

Here is a string-based approach in base R:

str <- paste(v, collapse = '-')

# "2-2-3-5-8-0-32-1-3-12-5-2-3-5-8-33-1"



pattern <- paste0('\b', paste(x, collapse = '-'), '\b')

# "\b2-3-5-8\b"



inds <- unlist(gregexpr(pattern, str)) # (1)

# 3 25

sapply(inds, function(i) lengths(strsplit(substr(str, 1, i),'-'))) # (2)



# [1]  2 12

\b is used for exact matching.

(1) Finds the positions at which pattern is seen within str.

(2) Getting back the respective indices within the original vector v.

UPDATE

As for the discussion of running-time efficiency, here is a much faster solution than my first solution:

str <- paste(v, collapse = '-')

pattern <- paste0('\b', paste(x, collapse = '-'), '\b')



inds <- c(1, unlist(gregexpr(pattern, str)))



m <- substring(str, head(inds,-1), tail(inds,-1))

ln <- lengths(strsplit(m, '-'))

cumsum(c(ln[1], ln[-1]-1))

edited Feb 8 '18 at 19:55

answered Feb 7 '18 at 15:51

989

8,98751834

2

I've updated the benchmarks and only included your fastest solution.

– Jaap
Feb 8 '18 at 19:35

I looked at what they return and then adjusted the solutions such that all would give the same result (didn't programmatically check it though)

– Jaap
Feb 8 '18 at 19:55

included now :-)

– Jaap
Feb 8 '18 at 20:11

thx for notifying, changed the construction of the vectors a bit; now it should return a normal vector :-)

– Jaap
Feb 8 '18 at 20:26

please leave a note under the respective answers so they can improve; could you check my benchmarking codes? it could as well that I made a mistake somewhere

– Jaap
Feb 8 '18 at 20:39

add a comment |

score 1 · Accepted Answer · 2018-02-12 15:36:45Z

1

EDIT: some have noted that my answer doesn't always give the desired output, I might fix it later, caution meanwhile!

We can convert v to factors and keep only consecutive values in our transformed vector:

v2 <- as.numeric(factor(c(v,NA),levels = x)) # [1]  1  1  2  3  4 NA NA NA ...

v2[is.na(v2)] <- length(x)+1                 # [1]  1  1  2  3  4  5  5  5 ...

output <- diff(v2) ==1

# [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

data

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)

x <- c(2,3,5,8)

edited Feb 12 '18 at 15:36

answered Feb 7 '18 at 13:03

Moody_Mudskipper

22.7k32964

1

that's pretty computationally intensive.

– Carl Witthoft
Feb 7 '18 at 14:30

is it ? I don't know, it's the only fully vectorized solution so far, too many copies ?

– Moody_Mudskipper
Feb 7 '18 at 14:33

I plead guilty to not having run microbenchmark on the various answers here. It's just a gut feeling because of the number of class coercions going on there.

– Carl Witthoft
Feb 7 '18 at 14:40

@CarlWitthoft, I guess that the answers by catastrophic-failure, which both utilise nested loops, will be much slower. But I too haven't tested any.

– docendo discimus
Feb 7 '18 at 14:51

1

@docendodiscimus see my latest benchmarks

– Carl Witthoft
Feb 7 '18 at 15:32

add a comment |

Get indexes of a vector of numbers in another vector

9 Answers 9

Optimising

A benchmark on the posted answers:

Benchmarking

Vector Length 1e4

Vector Length 1e5

Vector Length 1e6

Your Answer

Sign up or log in

Post as a guest

Post as a guest

9 Answers 9

9 Answers 9

Optimising

Optimising

Optimising

Optimising

A benchmark on the posted answers:

A benchmark on the posted answers:

A benchmark on the posted answers:

A benchmark on the posted answers:

Benchmarking

Vector Length 1e4

Vector Length 1e5

Vector Length 1e6

Benchmarking

Vector Length 1e4

Vector Length 1e5

Vector Length 1e6

Benchmarking

Vector Length 1e4

Vector Length 1e5

Vector Length 1e6

Benchmarking

Vector Length 1e4

Vector Length 1e5

Vector Length 1e6

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

List item for chat from Array inside array React Native

Jo Brand

Thiostrepton

9 Answers
9

9 Answers
9

9 Answers
9