How to convert the binary text generated in my .PDF to a string?

up vote
-2
down vote

favorite

I am using this code:

from PyPDF2 import PdfFileReader



def text_extractor(path):

    with open(path, 'rb') as f:

        pdf = PdfFileReader(f)



        # get the first page

        page = pdf.getPage(0)

        print(page)

        print('Page type: {}'.format(str(type(page))))



        text = page.extractText()

        print(text)





if __name__ == '__main__':

    path = 'XEROX.pdf'

    text_extractor(path)

But this return me:

{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),

 '/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],

 '/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},

 '/Contents': IndirectObject(5, 0)}

Page type: <class 'PyPDF2.pdf.PageObject'>

 !ˆ"#$

[Finished in 0.9s]

Where is the data?

I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?

This is the PDF's information that I should get

This is the result when I apply copy and paste in the PDF' information:

􀀀􀀀 􀀀



      􀀀  􀀀􀀀

edited Nov 9 at 23:43

martineau

64.4k887170

asked Nov 9 at 22:45

toni

This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53

add a comment |

up vote
-2
down vote

favorite

I am using this code:

from PyPDF2 import PdfFileReader



def text_extractor(path):

    with open(path, 'rb') as f:

        pdf = PdfFileReader(f)



        # get the first page

        page = pdf.getPage(0)

        print(page)

        print('Page type: {}'.format(str(type(page))))



        text = page.extractText()

        print(text)





if __name__ == '__main__':

    path = 'XEROX.pdf'

    text_extractor(path)

But this return me:

{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),

 '/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],

 '/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},

 '/Contents': IndirectObject(5, 0)}

Page type: <class 'PyPDF2.pdf.PageObject'>

 !ˆ"#$

[Finished in 0.9s]

Where is the data?

I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?

This is the PDF's information that I should get

This is the result when I apply copy and paste in the PDF' information:

􀀀􀀀 􀀀



      􀀀  􀀀􀀀

edited Nov 9 at 23:43

martineau

64.4k887170

asked Nov 9 at 22:45

toni

This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53

add a comment |

up vote
-2
down vote

favorite

I am using this code:

from PyPDF2 import PdfFileReader



def text_extractor(path):

    with open(path, 'rb') as f:

        pdf = PdfFileReader(f)



        # get the first page

        page = pdf.getPage(0)

        print(page)

        print('Page type: {}'.format(str(type(page))))



        text = page.extractText()

        print(text)





if __name__ == '__main__':

    path = 'XEROX.pdf'

    text_extractor(path)

But this return me:

{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),

 '/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],

 '/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},

 '/Contents': IndirectObject(5, 0)}

Page type: <class 'PyPDF2.pdf.PageObject'>

 !ˆ"#$

[Finished in 0.9s]

Where is the data?

I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?

This is the PDF's information that I should get

This is the result when I apply copy and paste in the PDF' information:

􀀀􀀀 􀀀



      􀀀  􀀀􀀀

edited Nov 9 at 23:43

martineau

64.4k887170

asked Nov 9 at 22:45

toni

I am using this code:

from PyPDF2 import PdfFileReader



def text_extractor(path):

    with open(path, 'rb') as f:

        pdf = PdfFileReader(f)



        # get the first page

        page = pdf.getPage(0)

        print(page)

        print('Page type: {}'.format(str(type(page))))



        text = page.extractText()

        print(text)





if __name__ == '__main__':

    path = 'XEROX.pdf'

    text_extractor(path)

But this return me:

{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),

 '/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],

 '/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},

 '/Contents': IndirectObject(5, 0)}

Page type: <class 'PyPDF2.pdf.PageObject'>

 !ˆ"#$

[Finished in 0.9s]

Where is the data?

I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?

This is the PDF's information that I should get

This is the result when I apply copy and paste in the PDF' information:

􀀀􀀀 􀀀



      􀀀  􀀀􀀀

python python-3.6 pypdf2

edited Nov 9 at 23:43

martineau

64.4k887170

asked Nov 9 at 22:45

toni

edited Nov 9 at 23:43

martineau

64.4k887170

asked Nov 9 at 22:45

toni

edited Nov 9 at 23:43

martineau

64.4k887170

edited Nov 9 at 23:43

martineau

64.4k887170

edited Nov 9 at 23:43

martineau

64.4k887170

asked Nov 9 at 22:45

toni

asked Nov 9 at 22:45

toni

asked Nov 9 at 22:45

toni

This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53

add a comment |

This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53

This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

I found it:

I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.

Regards

answered Nov 10 at 20:44

toni

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234223%2fhow-to-convert-the-binary-text-generated-in-my-pdf-to-a-string%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

I found it:

I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.

Regards

answered Nov 10 at 20:44

toni

add a comment |

up vote
0
down vote

I found it:

I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.

Regards

answered Nov 10 at 20:44

toni

add a comment |

up vote
0
down vote

I found it:

I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.

Regards

answered Nov 10 at 20:44

toni

I found it:

I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.

Regards

answered Nov 10 at 20:44

toni

answered Nov 10 at 20:44

toni

answered Nov 10 at 20:44

toni

answered Nov 10 at 20:44

toni

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky