How to search the field which could contains spaces,- and a concatenated number.?
Hi I have a field with the following schema,
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I am storing complete pdf documents.
Now suppose I have 4 documents with the following content.
1. stackoverflow is a good site.
2. stack-overflow is a good site.
3. stack overflow is a good site.
4. stackoverflow2018 is a good site.
Now when I search stackoverflow
It should return me 1,
when I search stack-overflow
it should return me 2.
when I search stack overflow
it should return me 3.
when I search stackoverflow2018
it should return me 4.
what should the schema for it the schema not working in this case.
Is there any thing I could specify in the query ?
solr lucene solr6
add a comment |
Hi I have a field with the following schema,
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I am storing complete pdf documents.
Now suppose I have 4 documents with the following content.
1. stackoverflow is a good site.
2. stack-overflow is a good site.
3. stack overflow is a good site.
4. stackoverflow2018 is a good site.
Now when I search stackoverflow
It should return me 1,
when I search stack-overflow
it should return me 2.
when I search stack overflow
it should return me 3.
when I search stackoverflow2018
it should return me 4.
what should the schema for it the schema not working in this case.
Is there any thing I could specify in the query ?
solr lucene solr6
add a comment |
Hi I have a field with the following schema,
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I am storing complete pdf documents.
Now suppose I have 4 documents with the following content.
1. stackoverflow is a good site.
2. stack-overflow is a good site.
3. stack overflow is a good site.
4. stackoverflow2018 is a good site.
Now when I search stackoverflow
It should return me 1,
when I search stack-overflow
it should return me 2.
when I search stack overflow
it should return me 3.
when I search stackoverflow2018
it should return me 4.
what should the schema for it the schema not working in this case.
Is there any thing I could specify in the query ?
solr lucene solr6
Hi I have a field with the following schema,
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I am storing complete pdf documents.
Now suppose I have 4 documents with the following content.
1. stackoverflow is a good site.
2. stack-overflow is a good site.
3. stack overflow is a good site.
4. stackoverflow2018 is a good site.
Now when I search stackoverflow
It should return me 1,
when I search stack-overflow
it should return me 2.
when I search stack overflow
it should return me 3.
when I search stackoverflow2018
it should return me 4.
what should the schema for it the schema not working in this case.
Is there any thing I could specify in the query ?
solr lucene solr6
solr lucene solr6
asked Nov 13 '18 at 12:15
RootRoot
313928
313928
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
A Word Delimiter Graph Filter will split on non-alphanumerics (-
), case changes, and numbers by default.
The rules for determining delimiters are determined as follows:
A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".
A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"
Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"
If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
Is there any way to discard/disable this behavior ?
– Root
Nov 13 '18 at 15:10
1
Not everything at once, but you can use thetypes
parameter with a file that redefines-
to be an alphanumeric -- => ALPHANUM
. See thetypes
parameter in the source linked above.
– MatsLindh
Nov 13 '18 at 15:48
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280807%2fhow-to-search-the-field-which-could-contains-spaces-and-a-concatenated-number%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
A Word Delimiter Graph Filter will split on non-alphanumerics (-
), case changes, and numbers by default.
The rules for determining delimiters are determined as follows:
A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".
A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"
Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"
If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
Is there any way to discard/disable this behavior ?
– Root
Nov 13 '18 at 15:10
1
Not everything at once, but you can use thetypes
parameter with a file that redefines-
to be an alphanumeric -- => ALPHANUM
. See thetypes
parameter in the source linked above.
– MatsLindh
Nov 13 '18 at 15:48
add a comment |
A Word Delimiter Graph Filter will split on non-alphanumerics (-
), case changes, and numbers by default.
The rules for determining delimiters are determined as follows:
A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".
A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"
Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"
If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
Is there any way to discard/disable this behavior ?
– Root
Nov 13 '18 at 15:10
1
Not everything at once, but you can use thetypes
parameter with a file that redefines-
to be an alphanumeric -- => ALPHANUM
. See thetypes
parameter in the source linked above.
– MatsLindh
Nov 13 '18 at 15:48
add a comment |
A Word Delimiter Graph Filter will split on non-alphanumerics (-
), case changes, and numbers by default.
The rules for determining delimiters are determined as follows:
A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".
A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"
Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"
If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.
A Word Delimiter Graph Filter will split on non-alphanumerics (-
), case changes, and numbers by default.
The rules for determining delimiters are determined as follows:
A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".
A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"
Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"
If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.
answered Nov 13 '18 at 12:23
MatsLindhMatsLindh
24.8k22241
24.8k22241
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
Is there any way to discard/disable this behavior ?
– Root
Nov 13 '18 at 15:10
1
Not everything at once, but you can use thetypes
parameter with a file that redefines-
to be an alphanumeric -- => ALPHANUM
. See thetypes
parameter in the source linked above.
– MatsLindh
Nov 13 '18 at 15:48
add a comment |
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
Is there any way to discard/disable this behavior ?
– Root
Nov 13 '18 at 15:10
1
Not everything at once, but you can use thetypes
parameter with a file that redefines-
to be an alphanumeric -- => ALPHANUM
. See thetypes
parameter in the source linked above.
– MatsLindh
Nov 13 '18 at 15:48
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
Is there any way to discard/disable this behavior ?– Root
Nov 13 '18 at 15:10
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
Is there any way to discard/disable this behavior ?– Root
Nov 13 '18 at 15:10
1
1
Not everything at once, but you can use the
types
parameter with a file that redefines -
to be an alphanumeric - - => ALPHANUM
. See the types
parameter in the source linked above.– MatsLindh
Nov 13 '18 at 15:48
Not everything at once, but you can use the
types
parameter with a file that redefines -
to be an alphanumeric - - => ALPHANUM
. See the types
parameter in the source linked above.– MatsLindh
Nov 13 '18 at 15:48
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280807%2fhow-to-search-the-field-which-could-contains-spaces-and-a-concatenated-number%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown