java- Full text inverted index defining a word
I am working on a simple full text inverted index trying to build an index of words that I extract from PDF files. I am using PDFBox library to achieve this.
However, I would like to know how does one define a definition of word to index.The way my indexing works is define every word with a space is a word token. For example,
This string, is a code.
In this case: the index table would contain
This
string,
is
a
code.
The flaw here is for like string,
, it comes with a comma where I think string
would just be sufficient enough because nobody searches string,
or code.
Back to my question, is there a specific rule there I could use to define my word token in a way to prevent this kind of issue with what I have ?
Code:
File folder = new File("D:\PDF1");
File listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile()) {
HashSet<String> uniqueWords = new HashSet<>();
String path = "D:\PDF1\" + file.getName();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines = pdfFileInText.split("\r?\n");
for(String line : lines) {
String words = line.split(" ");
for (String word : words) {
uniqueWords.add(word);
}
}
}
} catch (IOException e) {
System.err.println("Exception while trying to read pdf document - " + e);
}
}
}
java pdfbox
add a comment |
I am working on a simple full text inverted index trying to build an index of words that I extract from PDF files. I am using PDFBox library to achieve this.
However, I would like to know how does one define a definition of word to index.The way my indexing works is define every word with a space is a word token. For example,
This string, is a code.
In this case: the index table would contain
This
string,
is
a
code.
The flaw here is for like string,
, it comes with a comma where I think string
would just be sufficient enough because nobody searches string,
or code.
Back to my question, is there a specific rule there I could use to define my word token in a way to prevent this kind of issue with what I have ?
Code:
File folder = new File("D:\PDF1");
File listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile()) {
HashSet<String> uniqueWords = new HashSet<>();
String path = "D:\PDF1\" + file.getName();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines = pdfFileInText.split("\r?\n");
for(String line : lines) {
String words = line.split(" ");
for (String word : words) {
uniqueWords.add(word);
}
}
}
} catch (IOException e) {
System.err.println("Exception while trying to read pdf document - " + e);
}
}
}
java pdfbox
why don' t you replace,
with a""
?
– Scary Wombat
Nov 14 '18 at 1:54
@ScaryWombat What do you mean? Sorry, I'm a bit blur on this regular expression thing.
– Daredevil
Nov 14 '18 at 1:54
let's see, a word is aString
, aString
has a methodreplace
- so replace","
with""
- this is not regex. Then add it to your List
– Scary Wombat
Nov 14 '18 at 1:56
I see but would that contradict some special case like there is a sentence with date 15/12/2018 or f(x) = 2x +3y where it would be ideal to classify these as 2 words considering they are not separated by spaces.
– Daredevil
Nov 14 '18 at 1:58
The logic is yours, in my example all I am replacing iscomma
– Scary Wombat
Nov 14 '18 at 1:58
add a comment |
I am working on a simple full text inverted index trying to build an index of words that I extract from PDF files. I am using PDFBox library to achieve this.
However, I would like to know how does one define a definition of word to index.The way my indexing works is define every word with a space is a word token. For example,
This string, is a code.
In this case: the index table would contain
This
string,
is
a
code.
The flaw here is for like string,
, it comes with a comma where I think string
would just be sufficient enough because nobody searches string,
or code.
Back to my question, is there a specific rule there I could use to define my word token in a way to prevent this kind of issue with what I have ?
Code:
File folder = new File("D:\PDF1");
File listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile()) {
HashSet<String> uniqueWords = new HashSet<>();
String path = "D:\PDF1\" + file.getName();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines = pdfFileInText.split("\r?\n");
for(String line : lines) {
String words = line.split(" ");
for (String word : words) {
uniqueWords.add(word);
}
}
}
} catch (IOException e) {
System.err.println("Exception while trying to read pdf document - " + e);
}
}
}
java pdfbox
I am working on a simple full text inverted index trying to build an index of words that I extract from PDF files. I am using PDFBox library to achieve this.
However, I would like to know how does one define a definition of word to index.The way my indexing works is define every word with a space is a word token. For example,
This string, is a code.
In this case: the index table would contain
This
string,
is
a
code.
The flaw here is for like string,
, it comes with a comma where I think string
would just be sufficient enough because nobody searches string,
or code.
Back to my question, is there a specific rule there I could use to define my word token in a way to prevent this kind of issue with what I have ?
Code:
File folder = new File("D:\PDF1");
File listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile()) {
HashSet<String> uniqueWords = new HashSet<>();
String path = "D:\PDF1\" + file.getName();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines = pdfFileInText.split("\r?\n");
for(String line : lines) {
String words = line.split(" ");
for (String word : words) {
uniqueWords.add(word);
}
}
}
} catch (IOException e) {
System.err.println("Exception while trying to read pdf document - " + e);
}
}
}
java pdfbox
java pdfbox
edited Nov 14 '18 at 2:04
GBlodgett
9,74341733
9,74341733
asked Nov 14 '18 at 1:51
DaredevilDaredevil
18210
18210
why don' t you replace,
with a""
?
– Scary Wombat
Nov 14 '18 at 1:54
@ScaryWombat What do you mean? Sorry, I'm a bit blur on this regular expression thing.
– Daredevil
Nov 14 '18 at 1:54
let's see, a word is aString
, aString
has a methodreplace
- so replace","
with""
- this is not regex. Then add it to your List
– Scary Wombat
Nov 14 '18 at 1:56
I see but would that contradict some special case like there is a sentence with date 15/12/2018 or f(x) = 2x +3y where it would be ideal to classify these as 2 words considering they are not separated by spaces.
– Daredevil
Nov 14 '18 at 1:58
The logic is yours, in my example all I am replacing iscomma
– Scary Wombat
Nov 14 '18 at 1:58
add a comment |
why don' t you replace,
with a""
?
– Scary Wombat
Nov 14 '18 at 1:54
@ScaryWombat What do you mean? Sorry, I'm a bit blur on this regular expression thing.
– Daredevil
Nov 14 '18 at 1:54
let's see, a word is aString
, aString
has a methodreplace
- so replace","
with""
- this is not regex. Then add it to your List
– Scary Wombat
Nov 14 '18 at 1:56
I see but would that contradict some special case like there is a sentence with date 15/12/2018 or f(x) = 2x +3y where it would be ideal to classify these as 2 words considering they are not separated by spaces.
– Daredevil
Nov 14 '18 at 1:58
The logic is yours, in my example all I am replacing iscomma
– Scary Wombat
Nov 14 '18 at 1:58
why don' t you replace
,
with a ""
?– Scary Wombat
Nov 14 '18 at 1:54
why don' t you replace
,
with a ""
?– Scary Wombat
Nov 14 '18 at 1:54
@ScaryWombat What do you mean? Sorry, I'm a bit blur on this regular expression thing.
– Daredevil
Nov 14 '18 at 1:54
@ScaryWombat What do you mean? Sorry, I'm a bit blur on this regular expression thing.
– Daredevil
Nov 14 '18 at 1:54
let's see, a word is a
String
, a String
has a method replace
- so replace ","
with ""
- this is not regex. Then add it to your List– Scary Wombat
Nov 14 '18 at 1:56
let's see, a word is a
String
, a String
has a method replace
- so replace ","
with ""
- this is not regex. Then add it to your List– Scary Wombat
Nov 14 '18 at 1:56
I see but would that contradict some special case like there is a sentence with date 15/12/2018 or f(x) = 2x +3y where it would be ideal to classify these as 2 words considering they are not separated by spaces.
– Daredevil
Nov 14 '18 at 1:58
I see but would that contradict some special case like there is a sentence with date 15/12/2018 or f(x) = 2x +3y where it would be ideal to classify these as 2 words considering they are not separated by spaces.
– Daredevil
Nov 14 '18 at 1:58
The logic is yours, in my example all I am replacing is
comma
– Scary Wombat
Nov 14 '18 at 1:58
The logic is yours, in my example all I am replacing is
comma
– Scary Wombat
Nov 14 '18 at 1:58
add a comment |
2 Answers
2
active
oldest
votes
Yes. You can use replaceAll method to get rid of non-word characters like this:
uniqueWords.add(word.replaceAll("([\W]+$)|(^[\W]+)", ""));
what is \W? I am confused
– Daredevil
Nov 14 '18 at 2:14
non-word characters, W should be a capital one
– Aleksandr Gromov
Nov 14 '18 at 2:15
But there's gonna be a problem say I have a date 10/12/2018 and I have to include this whole in my index then it's gonna omit the "/" which I don't want
– Daredevil
Nov 14 '18 at 2:18
Edited. I added exclusion, you can add exclusions in this section [^/]. So now, it will remove all non-word characters except those which are provided in [^/] section
– Aleksandr Gromov
Nov 14 '18 at 2:42
There is a problem. If I have animal. then i would get animal which is fine. But what if I have 69.4 and i would like it in the same form, it would then omit the dot and becomes 694
– Daredevil
Nov 14 '18 at 2:44
|
show 1 more comment
If you wanted to remove all punctuation you could do:
for(String word : words) {
uniqueWords.add(word.replaceAll("[.,!?]", ""));
}
Which will replace all periods, commas, exclamation marks, and question marks.
If you also want to get rid of quotes you can do:
uniqueWords.add(word.replaceAll("[.,?!"]", "")
What does it do? But what if my sentence contains say 11/2/2018 and I would like it as a whole as a word. It would eliminate it right?
– Daredevil
Nov 14 '18 at 1:56
1
Which will replace all periods, commas, exclamation marks, and question marks
– Scary Wombat
Nov 14 '18 at 1:57
@Daredevil No it will not. Try it for yourself:System.out.println("10/2/18".replaceAll("[.,!?]", ""));
– GBlodgett
Nov 14 '18 at 1:58
Would it be possible to replace like "animal" to read it as animal? I tried including it as well but it wouldn't take the argument
– Daredevil
Nov 14 '18 at 2:14
@Daredevil What do you mean? Replaceanimal
with what?
– GBlodgett
Nov 14 '18 at 2:15
|
show 5 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53292067%2fjava-full-text-inverted-index-defining-a-word%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Yes. You can use replaceAll method to get rid of non-word characters like this:
uniqueWords.add(word.replaceAll("([\W]+$)|(^[\W]+)", ""));
what is \W? I am confused
– Daredevil
Nov 14 '18 at 2:14
non-word characters, W should be a capital one
– Aleksandr Gromov
Nov 14 '18 at 2:15
But there's gonna be a problem say I have a date 10/12/2018 and I have to include this whole in my index then it's gonna omit the "/" which I don't want
– Daredevil
Nov 14 '18 at 2:18
Edited. I added exclusion, you can add exclusions in this section [^/]. So now, it will remove all non-word characters except those which are provided in [^/] section
– Aleksandr Gromov
Nov 14 '18 at 2:42
There is a problem. If I have animal. then i would get animal which is fine. But what if I have 69.4 and i would like it in the same form, it would then omit the dot and becomes 694
– Daredevil
Nov 14 '18 at 2:44
|
show 1 more comment
Yes. You can use replaceAll method to get rid of non-word characters like this:
uniqueWords.add(word.replaceAll("([\W]+$)|(^[\W]+)", ""));
what is \W? I am confused
– Daredevil
Nov 14 '18 at 2:14
non-word characters, W should be a capital one
– Aleksandr Gromov
Nov 14 '18 at 2:15
But there's gonna be a problem say I have a date 10/12/2018 and I have to include this whole in my index then it's gonna omit the "/" which I don't want
– Daredevil
Nov 14 '18 at 2:18
Edited. I added exclusion, you can add exclusions in this section [^/]. So now, it will remove all non-word characters except those which are provided in [^/] section
– Aleksandr Gromov
Nov 14 '18 at 2:42
There is a problem. If I have animal. then i would get animal which is fine. But what if I have 69.4 and i would like it in the same form, it would then omit the dot and becomes 694
– Daredevil
Nov 14 '18 at 2:44
|
show 1 more comment
Yes. You can use replaceAll method to get rid of non-word characters like this:
uniqueWords.add(word.replaceAll("([\W]+$)|(^[\W]+)", ""));
Yes. You can use replaceAll method to get rid of non-word characters like this:
uniqueWords.add(word.replaceAll("([\W]+$)|(^[\W]+)", ""));
edited Nov 14 '18 at 4:01
answered Nov 14 '18 at 2:14
Aleksandr GromovAleksandr Gromov
463
463
what is \W? I am confused
– Daredevil
Nov 14 '18 at 2:14
non-word characters, W should be a capital one
– Aleksandr Gromov
Nov 14 '18 at 2:15
But there's gonna be a problem say I have a date 10/12/2018 and I have to include this whole in my index then it's gonna omit the "/" which I don't want
– Daredevil
Nov 14 '18 at 2:18
Edited. I added exclusion, you can add exclusions in this section [^/]. So now, it will remove all non-word characters except those which are provided in [^/] section
– Aleksandr Gromov
Nov 14 '18 at 2:42
There is a problem. If I have animal. then i would get animal which is fine. But what if I have 69.4 and i would like it in the same form, it would then omit the dot and becomes 694
– Daredevil
Nov 14 '18 at 2:44
|
show 1 more comment
what is \W? I am confused
– Daredevil
Nov 14 '18 at 2:14
non-word characters, W should be a capital one
– Aleksandr Gromov
Nov 14 '18 at 2:15
But there's gonna be a problem say I have a date 10/12/2018 and I have to include this whole in my index then it's gonna omit the "/" which I don't want
– Daredevil
Nov 14 '18 at 2:18
Edited. I added exclusion, you can add exclusions in this section [^/]. So now, it will remove all non-word characters except those which are provided in [^/] section
– Aleksandr Gromov
Nov 14 '18 at 2:42
There is a problem. If I have animal. then i would get animal which is fine. But what if I have 69.4 and i would like it in the same form, it would then omit the dot and becomes 694
– Daredevil
Nov 14 '18 at 2:44
what is \W? I am confused
– Daredevil
Nov 14 '18 at 2:14
what is \W? I am confused
– Daredevil
Nov 14 '18 at 2:14
non-word characters, W should be a capital one
– Aleksandr Gromov
Nov 14 '18 at 2:15
non-word characters, W should be a capital one
– Aleksandr Gromov
Nov 14 '18 at 2:15
But there's gonna be a problem say I have a date 10/12/2018 and I have to include this whole in my index then it's gonna omit the "/" which I don't want
– Daredevil
Nov 14 '18 at 2:18
But there's gonna be a problem say I have a date 10/12/2018 and I have to include this whole in my index then it's gonna omit the "/" which I don't want
– Daredevil
Nov 14 '18 at 2:18
Edited. I added exclusion, you can add exclusions in this section [^/]. So now, it will remove all non-word characters except those which are provided in [^/] section
– Aleksandr Gromov
Nov 14 '18 at 2:42
Edited. I added exclusion, you can add exclusions in this section [^/]. So now, it will remove all non-word characters except those which are provided in [^/] section
– Aleksandr Gromov
Nov 14 '18 at 2:42
There is a problem. If I have animal. then i would get animal which is fine. But what if I have 69.4 and i would like it in the same form, it would then omit the dot and becomes 694
– Daredevil
Nov 14 '18 at 2:44
There is a problem. If I have animal. then i would get animal which is fine. But what if I have 69.4 and i would like it in the same form, it would then omit the dot and becomes 694
– Daredevil
Nov 14 '18 at 2:44
|
show 1 more comment
If you wanted to remove all punctuation you could do:
for(String word : words) {
uniqueWords.add(word.replaceAll("[.,!?]", ""));
}
Which will replace all periods, commas, exclamation marks, and question marks.
If you also want to get rid of quotes you can do:
uniqueWords.add(word.replaceAll("[.,?!"]", "")
What does it do? But what if my sentence contains say 11/2/2018 and I would like it as a whole as a word. It would eliminate it right?
– Daredevil
Nov 14 '18 at 1:56
1
Which will replace all periods, commas, exclamation marks, and question marks
– Scary Wombat
Nov 14 '18 at 1:57
@Daredevil No it will not. Try it for yourself:System.out.println("10/2/18".replaceAll("[.,!?]", ""));
– GBlodgett
Nov 14 '18 at 1:58
Would it be possible to replace like "animal" to read it as animal? I tried including it as well but it wouldn't take the argument
– Daredevil
Nov 14 '18 at 2:14
@Daredevil What do you mean? Replaceanimal
with what?
– GBlodgett
Nov 14 '18 at 2:15
|
show 5 more comments
If you wanted to remove all punctuation you could do:
for(String word : words) {
uniqueWords.add(word.replaceAll("[.,!?]", ""));
}
Which will replace all periods, commas, exclamation marks, and question marks.
If you also want to get rid of quotes you can do:
uniqueWords.add(word.replaceAll("[.,?!"]", "")
What does it do? But what if my sentence contains say 11/2/2018 and I would like it as a whole as a word. It would eliminate it right?
– Daredevil
Nov 14 '18 at 1:56
1
Which will replace all periods, commas, exclamation marks, and question marks
– Scary Wombat
Nov 14 '18 at 1:57
@Daredevil No it will not. Try it for yourself:System.out.println("10/2/18".replaceAll("[.,!?]", ""));
– GBlodgett
Nov 14 '18 at 1:58
Would it be possible to replace like "animal" to read it as animal? I tried including it as well but it wouldn't take the argument
– Daredevil
Nov 14 '18 at 2:14
@Daredevil What do you mean? Replaceanimal
with what?
– GBlodgett
Nov 14 '18 at 2:15
|
show 5 more comments
If you wanted to remove all punctuation you could do:
for(String word : words) {
uniqueWords.add(word.replaceAll("[.,!?]", ""));
}
Which will replace all periods, commas, exclamation marks, and question marks.
If you also want to get rid of quotes you can do:
uniqueWords.add(word.replaceAll("[.,?!"]", "")
If you wanted to remove all punctuation you could do:
for(String word : words) {
uniqueWords.add(word.replaceAll("[.,!?]", ""));
}
Which will replace all periods, commas, exclamation marks, and question marks.
If you also want to get rid of quotes you can do:
uniqueWords.add(word.replaceAll("[.,?!"]", "")
edited Nov 14 '18 at 2:26
answered Nov 14 '18 at 1:54
GBlodgettGBlodgett
9,74341733
9,74341733
What does it do? But what if my sentence contains say 11/2/2018 and I would like it as a whole as a word. It would eliminate it right?
– Daredevil
Nov 14 '18 at 1:56
1
Which will replace all periods, commas, exclamation marks, and question marks
– Scary Wombat
Nov 14 '18 at 1:57
@Daredevil No it will not. Try it for yourself:System.out.println("10/2/18".replaceAll("[.,!?]", ""));
– GBlodgett
Nov 14 '18 at 1:58
Would it be possible to replace like "animal" to read it as animal? I tried including it as well but it wouldn't take the argument
– Daredevil
Nov 14 '18 at 2:14
@Daredevil What do you mean? Replaceanimal
with what?
– GBlodgett
Nov 14 '18 at 2:15
|
show 5 more comments
What does it do? But what if my sentence contains say 11/2/2018 and I would like it as a whole as a word. It would eliminate it right?
– Daredevil
Nov 14 '18 at 1:56
1
Which will replace all periods, commas, exclamation marks, and question marks
– Scary Wombat
Nov 14 '18 at 1:57
@Daredevil No it will not. Try it for yourself:System.out.println("10/2/18".replaceAll("[.,!?]", ""));
– GBlodgett
Nov 14 '18 at 1:58
Would it be possible to replace like "animal" to read it as animal? I tried including it as well but it wouldn't take the argument
– Daredevil
Nov 14 '18 at 2:14
@Daredevil What do you mean? Replaceanimal
with what?
– GBlodgett
Nov 14 '18 at 2:15
What does it do? But what if my sentence contains say 11/2/2018 and I would like it as a whole as a word. It would eliminate it right?
– Daredevil
Nov 14 '18 at 1:56
What does it do? But what if my sentence contains say 11/2/2018 and I would like it as a whole as a word. It would eliminate it right?
– Daredevil
Nov 14 '18 at 1:56
1
1
Which will replace all periods, commas, exclamation marks, and question marks
– Scary Wombat
Nov 14 '18 at 1:57
Which will replace all periods, commas, exclamation marks, and question marks
– Scary Wombat
Nov 14 '18 at 1:57
@Daredevil No it will not. Try it for yourself:
System.out.println("10/2/18".replaceAll("[.,!?]", ""));
– GBlodgett
Nov 14 '18 at 1:58
@Daredevil No it will not. Try it for yourself:
System.out.println("10/2/18".replaceAll("[.,!?]", ""));
– GBlodgett
Nov 14 '18 at 1:58
Would it be possible to replace like "animal" to read it as animal? I tried including it as well but it wouldn't take the argument
– Daredevil
Nov 14 '18 at 2:14
Would it be possible to replace like "animal" to read it as animal? I tried including it as well but it wouldn't take the argument
– Daredevil
Nov 14 '18 at 2:14
@Daredevil What do you mean? Replace
animal
with what?– GBlodgett
Nov 14 '18 at 2:15
@Daredevil What do you mean? Replace
animal
with what?– GBlodgett
Nov 14 '18 at 2:15
|
show 5 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53292067%2fjava-full-text-inverted-index-defining-a-word%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
why don' t you replace
,
with a""
?– Scary Wombat
Nov 14 '18 at 1:54
@ScaryWombat What do you mean? Sorry, I'm a bit blur on this regular expression thing.
– Daredevil
Nov 14 '18 at 1:54
let's see, a word is a
String
, aString
has a methodreplace
- so replace","
with""
- this is not regex. Then add it to your List– Scary Wombat
Nov 14 '18 at 1:56
I see but would that contradict some special case like there is a sentence with date 15/12/2018 or f(x) = 2x +3y where it would be ideal to classify these as 2 words considering they are not separated by spaces.
– Daredevil
Nov 14 '18 at 1:58
The logic is yours, in my example all I am replacing is
comma
– Scary Wombat
Nov 14 '18 at 1:58