How to search the field which could contains spaces,- and a concatenated number.?

Hi I have a field with the following schema,

  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">

      <tokenizer class="solr.WhitespaceTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.WhitespaceTokenizerFactory"/>

      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="0"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

    </analyzer>

  </fieldType>

I am storing complete pdf documents.

Now suppose I have 4 documents with the following content.

1. stackoverflow is a good site.

2. stack-overflow is a good site.

3. stack overflow is a good site.

4. stackoverflow2018 is a good site.

Now when I search stackoverflow It should return me 1,
when I search stack-overflow it should return me 2.
when I search stack overflow it should return me 3.
when I search stackoverflow2018 it should return me 4.

what should the schema for it the schema not working in this case.
Is there any thing I could specify in the query ?

asked Nov 13 '18 at 12:15

Root

313928

add a comment |

Hi I have a field with the following schema,

  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">

      <tokenizer class="solr.WhitespaceTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.WhitespaceTokenizerFactory"/>

      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="0"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

    </analyzer>

  </fieldType>

I am storing complete pdf documents.

Now suppose I have 4 documents with the following content.

1. stackoverflow is a good site.

2. stack-overflow is a good site.

3. stack overflow is a good site.

4. stackoverflow2018 is a good site.

what should the schema for it the schema not working in this case.
Is there any thing I could specify in the query ?

asked Nov 13 '18 at 12:15

Root

313928

add a comment |

Hi I have a field with the following schema,

  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">

      <tokenizer class="solr.WhitespaceTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.WhitespaceTokenizerFactory"/>

      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="0"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

    </analyzer>

  </fieldType>

I am storing complete pdf documents.

Now suppose I have 4 documents with the following content.

1. stackoverflow is a good site.

2. stack-overflow is a good site.

3. stack overflow is a good site.

4. stackoverflow2018 is a good site.

what should the schema for it the schema not working in this case.
Is there any thing I could specify in the query ?

asked Nov 13 '18 at 12:15

Root

313928

Hi I have a field with the following schema,

  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">

      <tokenizer class="solr.WhitespaceTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.WhitespaceTokenizerFactory"/>

      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="0"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

    </analyzer>

  </fieldType>

I am storing complete pdf documents.

Now suppose I have 4 documents with the following content.

1. stackoverflow is a good site.

2. stack-overflow is a good site.

3. stack overflow is a good site.

4. stackoverflow2018 is a good site.

what should the schema for it the schema not working in this case.
Is there any thing I could specify in the query ?

solr lucene solr6

asked Nov 13 '18 at 12:15

Root

313928

asked Nov 13 '18 at 12:15

Root

313928

asked Nov 13 '18 at 12:15

Root

313928

asked Nov 13 '18 at 12:15

Root

313928

asked Nov 13 '18 at 12:15

Root

313928

add a comment |

1 Answer
1

active

oldest

votes

A Word Delimiter Graph Filter will split on non-alphanumerics (-), case changes, and numbers by default.

The rules for determining delimiters are determined as follows:

A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".

A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".

Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"

A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"

Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"

If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.

answered Nov 13 '18 at 12:23

MatsLindh

24.8k22241

Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot" Is there any way to discard/disable this behavior ?

– Root
Nov 13 '18 at 15:10

1

Not everything at once, but you can use the types parameter with a file that redefines - to be an alphanumeric - - => ALPHANUM. See the types parameter in the source linked above.

– MatsLindh
Nov 13 '18 at 15:48

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280807%2fhow-to-search-the-field-which-could-contains-spaces-and-a-concatenated-number%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

A Word Delimiter Graph Filter will split on non-alphanumerics (-), case changes, and numbers by default.

The rules for determining delimiters are determined as follows:

A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".

A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".

Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"

A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"

Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"

If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.

answered Nov 13 '18 at 12:23

MatsLindh

24.8k22241

Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot" Is there any way to discard/disable this behavior ?

– Root
Nov 13 '18 at 15:10

1

Not everything at once, but you can use the types parameter with a file that redefines - to be an alphanumeric - - => ALPHANUM. See the types parameter in the source linked above.

– MatsLindh
Nov 13 '18 at 15:48

add a comment |

A Word Delimiter Graph Filter will split on non-alphanumerics (-), case changes, and numbers by default.

The rules for determining delimiters are determined as follows:

A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".

A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".

Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"

A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"

Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"

If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.

answered Nov 13 '18 at 12:23

MatsLindh

24.8k22241

Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot" Is there any way to discard/disable this behavior ?

– Root
Nov 13 '18 at 15:10

1

Not everything at once, but you can use the types parameter with a file that redefines - to be an alphanumeric - - => ALPHANUM. See the types parameter in the source linked above.

– MatsLindh
Nov 13 '18 at 15:48

add a comment |

A Word Delimiter Graph Filter will split on non-alphanumerics (-), case changes, and numbers by default.

The rules for determining delimiters are determined as follows:

A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".

A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".

Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"

A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"

Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"

If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.

answered Nov 13 '18 at 12:23

MatsLindh

24.8k22241

A Word Delimiter Graph Filter will split on non-alphanumerics (-), case changes, and numbers by default.

The rules for determining delimiters are determined as follows:

A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".

A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".

Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"

A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"

Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"

If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.

answered Nov 13 '18 at 12:23

MatsLindh

24.8k22241

answered Nov 13 '18 at 12:23

MatsLindh

24.8k22241

answered Nov 13 '18 at 12:23

MatsLindh

24.8k22241

answered Nov 13 '18 at 12:23

MatsLindh

24.8k22241

Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot" Is there any way to discard/disable this behavior ?

– Root
Nov 13 '18 at 15:10

1

Not everything at once, but you can use the types parameter with a file that redefines - to be an alphanumeric - - => ALPHANUM. See the types parameter in the source linked above.

– MatsLindh
Nov 13 '18 at 15:48

add a comment |

Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot" Is there any way to discard/disable this behavior ?

– Root
Nov 13 '18 at 15:10

1

Not everything at once, but you can use the types parameter with a file that redefines - to be an alphanumeric - - => ALPHANUM. See the types parameter in the source linked above.

– MatsLindh
Nov 13 '18 at 15:48

Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot" Is there any way to discard/disable this behavior ?

– Root
Nov 13 '18 at 15:10

Not everything at once, but you can use the types parameter with a file that redefines - to be an alphanumeric - - => ALPHANUM. See the types parameter in the source linked above.

– MatsLindh
Nov 13 '18 at 15:48

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky