How to convert the binary text generated in my .PDF to a string?
up vote
-2
down vote
favorite
I am using this code:
from PyPDF2 import PdfFileReader
def text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
# get the first page
page = pdf.getPage(0)
print(page)
print('Page type: {}'.format(str(type(page))))
text = page.extractText()
print(text)
if __name__ == '__main__':
path = 'XEROX.pdf'
text_extractor(path)
But this return me:
{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),
'/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],
'/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},
'/Contents': IndirectObject(5, 0)}
Page type: <class 'PyPDF2.pdf.PageObject'>
!ˆ"#$
[Finished in 0.9s]
Where is the data?
I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?
This is the result when I apply copy and paste in the PDF' information:
python python-3.6 pypdf2
add a comment |
up vote
-2
down vote
favorite
I am using this code:
from PyPDF2 import PdfFileReader
def text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
# get the first page
page = pdf.getPage(0)
print(page)
print('Page type: {}'.format(str(type(page))))
text = page.extractText()
print(text)
if __name__ == '__main__':
path = 'XEROX.pdf'
text_extractor(path)
But this return me:
{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),
'/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],
'/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},
'/Contents': IndirectObject(5, 0)}
Page type: <class 'PyPDF2.pdf.PageObject'>
!ˆ"#$
[Finished in 0.9s]
Where is the data?
I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?
This is the result when I apply copy and paste in the PDF' information:
python python-3.6 pypdf2
This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53
add a comment |
up vote
-2
down vote
favorite
up vote
-2
down vote
favorite
I am using this code:
from PyPDF2 import PdfFileReader
def text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
# get the first page
page = pdf.getPage(0)
print(page)
print('Page type: {}'.format(str(type(page))))
text = page.extractText()
print(text)
if __name__ == '__main__':
path = 'XEROX.pdf'
text_extractor(path)
But this return me:
{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),
'/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],
'/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},
'/Contents': IndirectObject(5, 0)}
Page type: <class 'PyPDF2.pdf.PageObject'>
!ˆ"#$
[Finished in 0.9s]
Where is the data?
I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?
This is the result when I apply copy and paste in the PDF' information:
python python-3.6 pypdf2
I am using this code:
from PyPDF2 import PdfFileReader
def text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
# get the first page
page = pdf.getPage(0)
print(page)
print('Page type: {}'.format(str(type(page))))
text = page.extractText()
print(text)
if __name__ == '__main__':
path = 'XEROX.pdf'
text_extractor(path)
But this return me:
{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),
'/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],
'/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},
'/Contents': IndirectObject(5, 0)}
Page type: <class 'PyPDF2.pdf.PageObject'>
!ˆ"#$
[Finished in 0.9s]
Where is the data?
I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?
This is the result when I apply copy and paste in the PDF' information:
python python-3.6 pypdf2
python python-3.6 pypdf2
edited Nov 9 at 23:43
martineau
64.4k887170
64.4k887170
asked Nov 9 at 22:45
toni
63
63
This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53
add a comment |
This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53
This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53
This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
I found it:
I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.
Regards
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
I found it:
I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.
Regards
add a comment |
up vote
0
down vote
I found it:
I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.
Regards
add a comment |
up vote
0
down vote
up vote
0
down vote
I found it:
I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.
Regards
I found it:
I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.
Regards
answered Nov 10 at 20:44
toni
63
63
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234223%2fhow-to-convert-the-binary-text-generated-in-my-pdf-to-a-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53