How to convert the binary text generated in my .PDF to a string?











up vote
-2
down vote

favorite












I am using this code:



from PyPDF2 import PdfFileReader

def text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)

# get the first page
page = pdf.getPage(0)
print(page)
print('Page type: {}'.format(str(type(page))))

text = page.extractText()
print(text)


if __name__ == '__main__':
path = 'XEROX.pdf'
text_extractor(path)


But this return me:



{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),
'/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],
'/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},
'/Contents': IndirectObject(5, 0)}
Page type: <class 'PyPDF2.pdf.PageObject'>
!ˆ"#$
[Finished in 0.9s]


Where is the data?



I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?



This is the PDF's information that I should get



This is the result when I apply copy and paste in the PDF' information:



􀀀􀀀 􀀀

􀀀 􀀀􀀀









share|improve this question
























  • This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
    – HFBrowning
    Nov 9 at 22:53















up vote
-2
down vote

favorite












I am using this code:



from PyPDF2 import PdfFileReader

def text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)

# get the first page
page = pdf.getPage(0)
print(page)
print('Page type: {}'.format(str(type(page))))

text = page.extractText()
print(text)


if __name__ == '__main__':
path = 'XEROX.pdf'
text_extractor(path)


But this return me:



{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),
'/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],
'/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},
'/Contents': IndirectObject(5, 0)}
Page type: <class 'PyPDF2.pdf.PageObject'>
!ˆ"#$
[Finished in 0.9s]


Where is the data?



I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?



This is the PDF's information that I should get



This is the result when I apply copy and paste in the PDF' information:



􀀀􀀀 􀀀

􀀀 􀀀􀀀









share|improve this question
























  • This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
    – HFBrowning
    Nov 9 at 22:53













up vote
-2
down vote

favorite









up vote
-2
down vote

favorite











I am using this code:



from PyPDF2 import PdfFileReader

def text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)

# get the first page
page = pdf.getPage(0)
print(page)
print('Page type: {}'.format(str(type(page))))

text = page.extractText()
print(text)


if __name__ == '__main__':
path = 'XEROX.pdf'
text_extractor(path)


But this return me:



{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),
'/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],
'/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},
'/Contents': IndirectObject(5, 0)}
Page type: <class 'PyPDF2.pdf.PageObject'>
!ˆ"#$
[Finished in 0.9s]


Where is the data?



I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?



This is the PDF's information that I should get



This is the result when I apply copy and paste in the PDF' information:



􀀀􀀀 􀀀

􀀀 􀀀􀀀









share|improve this question















I am using this code:



from PyPDF2 import PdfFileReader

def text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)

# get the first page
page = pdf.getPage(0)
print(page)
print('Page type: {}'.format(str(type(page))))

text = page.extractText()
print(text)


if __name__ == '__main__':
path = 'XEROX.pdf'
text_extractor(path)


But this return me:



{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),
'/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],
'/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},
'/Contents': IndirectObject(5, 0)}
Page type: <class 'PyPDF2.pdf.PageObject'>
!ˆ"#$
[Finished in 0.9s]


Where is the data?



I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?



This is the PDF's information that I should get



This is the result when I apply copy and paste in the PDF' information:



􀀀􀀀 􀀀

􀀀 􀀀􀀀






python python-3.6 pypdf2






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 9 at 23:43









martineau

64.4k887170




64.4k887170










asked Nov 9 at 22:45









toni

63




63












  • This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
    – HFBrowning
    Nov 9 at 22:53


















  • This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
    – HFBrowning
    Nov 9 at 22:53
















This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53




This is basically a duplicate of a question with no accepted answer: stackoverflow.com/questions/34837707/…
– HFBrowning
Nov 9 at 22:53












1 Answer
1






active

oldest

votes

















up vote
0
down vote













I found it:



I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.



Regards






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234223%2fhow-to-convert-the-binary-text-generated-in-my-pdf-to-a-string%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    I found it:



    I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.



    Regards






    share|improve this answer

























      up vote
      0
      down vote













      I found it:



      I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.



      Regards






      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        I found it:



        I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.



        Regards






        share|improve this answer












        I found it:



        I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.



        Regards







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 10 at 20:44









        toni

        63




        63






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234223%2fhow-to-convert-the-binary-text-generated-in-my-pdf-to-a-string%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Bressuire

            Vorschmack

            Quarantine