Text Similarity - Cosine

Text Similarity - Cosine - Control

-1

I would like to ask you, if anybody could check my code, because it was behaving weird - not working, giving me errors to suddenly working without changing anything - the code will be at the bottom.

Background: So my goal is to calculate text similarity [cosine, for now] between annual statements given by several countries at the UN General Assembly. More specifically find similarity between statement x and statement y in given year and do it for all 45 years. So I can make a graph for its evolution.

How I went about it: So [im novice] I decided to do the work in several steps - finding the similarity of statements of country A to country B first, and then re-doing the work for other countries (country A stays, everything is to country A).

So I filtered statements for Country A, arranged by year. Did text-preprocessing (tokenization, to lower, stopwords, lemenization, bag-of-words). And then I made a TF-IDF matrix from it - named: text.tokens.tfidf

I did the same process for Country B, and got text.tokensChina.tfidf - just replacing all text.tokens to text.tokensChina on new paper. So each matrix contains tf-idf of annual statements from 1971 - 2005, where Rows = documents (years) and columns = terms.

Calculating cosine similarity: So I decided to use Text2Vec as is described here - however, I did not define common space and project documents to it - dunno if it's crucial. And then decided to text two functionssim2 and psim2 since I did not know the difference in parallel.

What was wrong at the start: When first running the functions, I was getting an error, probably telling me, that my lengths of columns in the two TF-IDF matrixes are not matched:

ncol(x) == ncol(y) is not TRUE

However, re-running the code for all my steps and then trying again, it worked, but I did not change anything ...

Results: Result for the function sim2 is weird table [1:45, 1:45]. Clearly not what I wanted - one column with the similarity between the speech of Country A and country B in given year.

Result for the function psim2 is better - one column with the results [not sure, how right they are though].

Technical questions: Using Psim2 is what I wanna - Not I see that sim2 created something like correlation heat map, my bad. But why is the Psim2 function working, even when the length of columns is different (picture)? Also, did I not do anything wrong, especially when I did not create a common space?

Code, picture:

    # *** Text Pre-Processing with Quanteda *** 

      # 1. Tokenization

      text.tokens <- tokens(docs$text, what = 'word',

                          remove_numbers = TRUE,

                          remove_punct = TRUE,

                          remove_symbols = TRUE,

                          remove_hyphens = TRUE)



      # 2. Transform words to lower case

      text.tokens <- tokens_tolower(text.tokens)



      # 3. Removing stop-words (Using quanteda's built-in stopwords list)

      text.tokens <- tokens_select(text.tokens, stopwords(),

                                   selection = 'remove')

      # 4. Perform stemming on the tokens.

      text.tokens <- tokens_wordstem(text.tokens, language = 'english')



      # 5. Create bag-of-words model / document feature(frequance)

      text.tokens.dfm <- dfm(text.tokens, tolower = FALSE)



      # 6. Transform to a matrix to work with and inspect

      text.tokens.matrix <- as.matrix(text.tokens.dfm)

      dim(text.tokens.matrix)



    # *** Doing TF-IDF *** 

      # Defining Function for calculating relative term frequency (TF)

      term.frequency <- function(row) {

        row / sum(row)

      }

      # Defining Function for calculating inverse document frequency (IDF)

      inverse.doc.freq <- function(col) {

        corpus.size <- length(col)

        doc.count <- length(which(col > 0))



        log10(corpus.size / doc.count)

      }

      # Defining function for calculating TD-IDF

      tf.idf <- function(tf, idf) {

        tf * idf

      }



      # 1. First step, normalize all documents via TF.

      text.tokens.df <- apply(text.tokens.matrix, 1, term.frequency)

      dim(text.tokens.df)



      # 2. Second step, calculate the IDF vector 

      text.tokens.idf <- apply(text.tokens.matrix, 2, inverse.doc.freq)

      str(text.tokens.idf)



      # 3. Lastly, calculate TF-IDF for our corpus

        # Apply function on columns, because matrix is transposed from TF function  

        text.tokens.tfidf <- apply(text.tokens.df, 2, tf.idf, idf = text.tokens.idf)

        dim(text.tokens.tfidf)



      # Now, transpose the matrix back

        text.tokens.tfidf <- t(text.tokens.tfidf)

        dim(text.tokens.tfidf)



     # Cosine similarity using Text2Vec 

  similarity.sim2 <- sim2(text.tokensChina.tfidf, text.tokensChina.tfidf, method = "cosine", norm = "none")



  similarity.psim2 <- psim2(text.tokensChina.tfidf, text.tokensChina.tfidf, method = "cosine", norm = "none")

  similarity.psim2 <- as.data.frame(similarity.psim2)

Global Enviroment picture:
Picture of my screen with Global Environment + Psim2 Results

asked Nov 15 '18 at 18:43

Kamil Liskutin

This is not really the kind of question for Stack Overflow. You might do better to post this on Code Review

– G5W
Nov 15 '18 at 19:16

Oh thank you, was not aware of the code review :/

– Kamil Liskutin
Nov 15 '18 at 19:34

See quanteda::dfm_tfidf().

– Ken Benoit
Nov 15 '18 at 20:02

Yeah, I know you can skin most of the preprocessing code, I just wanted to have it step by them cuz I am new to it so I remember it. Afterward will be using that.

– Kamil Liskutin
Nov 15 '18 at 20:40

add a comment |

-1

I would like to ask you, if anybody could check my code, because it was behaving weird - not working, giving me errors to suddenly working without changing anything - the code will be at the bottom.

What was wrong at the start: When first running the functions, I was getting an error, probably telling me, that my lengths of columns in the two TF-IDF matrixes are not matched:

ncol(x) == ncol(y) is not TRUE

However, re-running the code for all my steps and then trying again, it worked, but I did not change anything ...

Results: Result for the function sim2 is weird table [1:45, 1:45]. Clearly not what I wanted - one column with the similarity between the speech of Country A and country B in given year.

Result for the function psim2 is better - one column with the results [not sure, how right they are though].

Code, picture:

    # *** Text Pre-Processing with Quanteda *** 

      # 1. Tokenization

      text.tokens <- tokens(docs$text, what = 'word',

                          remove_numbers = TRUE,

                          remove_punct = TRUE,

                          remove_symbols = TRUE,

                          remove_hyphens = TRUE)



      # 2. Transform words to lower case

      text.tokens <- tokens_tolower(text.tokens)



      # 3. Removing stop-words (Using quanteda's built-in stopwords list)

      text.tokens <- tokens_select(text.tokens, stopwords(),

                                   selection = 'remove')

      # 4. Perform stemming on the tokens.

      text.tokens <- tokens_wordstem(text.tokens, language = 'english')



      # 5. Create bag-of-words model / document feature(frequance)

      text.tokens.dfm <- dfm(text.tokens, tolower = FALSE)



      # 6. Transform to a matrix to work with and inspect

      text.tokens.matrix <- as.matrix(text.tokens.dfm)

      dim(text.tokens.matrix)



    # *** Doing TF-IDF *** 

      # Defining Function for calculating relative term frequency (TF)

      term.frequency <- function(row) {

        row / sum(row)

      }

      # Defining Function for calculating inverse document frequency (IDF)

      inverse.doc.freq <- function(col) {

        corpus.size <- length(col)

        doc.count <- length(which(col > 0))



        log10(corpus.size / doc.count)

      }

      # Defining function for calculating TD-IDF

      tf.idf <- function(tf, idf) {

        tf * idf

      }



      # 1. First step, normalize all documents via TF.

      text.tokens.df <- apply(text.tokens.matrix, 1, term.frequency)

      dim(text.tokens.df)



      # 2. Second step, calculate the IDF vector 

      text.tokens.idf <- apply(text.tokens.matrix, 2, inverse.doc.freq)

      str(text.tokens.idf)



      # 3. Lastly, calculate TF-IDF for our corpus

        # Apply function on columns, because matrix is transposed from TF function  

        text.tokens.tfidf <- apply(text.tokens.df, 2, tf.idf, idf = text.tokens.idf)

        dim(text.tokens.tfidf)



      # Now, transpose the matrix back

        text.tokens.tfidf <- t(text.tokens.tfidf)

        dim(text.tokens.tfidf)



     # Cosine similarity using Text2Vec 

  similarity.sim2 <- sim2(text.tokensChina.tfidf, text.tokensChina.tfidf, method = "cosine", norm = "none")



  similarity.psim2 <- psim2(text.tokensChina.tfidf, text.tokensChina.tfidf, method = "cosine", norm = "none")

  similarity.psim2 <- as.data.frame(similarity.psim2)

Global Enviroment picture:
Picture of my screen with Global Environment + Psim2 Results

asked Nov 15 '18 at 18:43

Kamil Liskutin

This is not really the kind of question for Stack Overflow. You might do better to post this on Code Review

– G5W
Nov 15 '18 at 19:16

Oh thank you, was not aware of the code review :/

– Kamil Liskutin
Nov 15 '18 at 19:34

See quanteda::dfm_tfidf().

– Ken Benoit
Nov 15 '18 at 20:02

Yeah, I know you can skin most of the preprocessing code, I just wanted to have it step by them cuz I am new to it so I remember it. Afterward will be using that.

– Kamil Liskutin
Nov 15 '18 at 20:40

add a comment |

-1

I would like to ask you, if anybody could check my code, because it was behaving weird - not working, giving me errors to suddenly working without changing anything - the code will be at the bottom.

What was wrong at the start: When first running the functions, I was getting an error, probably telling me, that my lengths of columns in the two TF-IDF matrixes are not matched:

ncol(x) == ncol(y) is not TRUE

However, re-running the code for all my steps and then trying again, it worked, but I did not change anything ...

Results: Result for the function sim2 is weird table [1:45, 1:45]. Clearly not what I wanted - one column with the similarity between the speech of Country A and country B in given year.

Result for the function psim2 is better - one column with the results [not sure, how right they are though].

Code, picture:

    # *** Text Pre-Processing with Quanteda *** 

      # 1. Tokenization

      text.tokens <- tokens(docs$text, what = 'word',

                          remove_numbers = TRUE,

                          remove_punct = TRUE,

                          remove_symbols = TRUE,

                          remove_hyphens = TRUE)



      # 2. Transform words to lower case

      text.tokens <- tokens_tolower(text.tokens)



      # 3. Removing stop-words (Using quanteda's built-in stopwords list)

      text.tokens <- tokens_select(text.tokens, stopwords(),

                                   selection = 'remove')

      # 4. Perform stemming on the tokens.

      text.tokens <- tokens_wordstem(text.tokens, language = 'english')



      # 5. Create bag-of-words model / document feature(frequance)

      text.tokens.dfm <- dfm(text.tokens, tolower = FALSE)



      # 6. Transform to a matrix to work with and inspect

      text.tokens.matrix <- as.matrix(text.tokens.dfm)

      dim(text.tokens.matrix)



    # *** Doing TF-IDF *** 

      # Defining Function for calculating relative term frequency (TF)

      term.frequency <- function(row) {

        row / sum(row)

      }

      # Defining Function for calculating inverse document frequency (IDF)

      inverse.doc.freq <- function(col) {

        corpus.size <- length(col)

        doc.count <- length(which(col > 0))



        log10(corpus.size / doc.count)

      }

      # Defining function for calculating TD-IDF

      tf.idf <- function(tf, idf) {

        tf * idf

      }



      # 1. First step, normalize all documents via TF.

      text.tokens.df <- apply(text.tokens.matrix, 1, term.frequency)

      dim(text.tokens.df)



      # 2. Second step, calculate the IDF vector 

      text.tokens.idf <- apply(text.tokens.matrix, 2, inverse.doc.freq)

      str(text.tokens.idf)



      # 3. Lastly, calculate TF-IDF for our corpus

        # Apply function on columns, because matrix is transposed from TF function  

        text.tokens.tfidf <- apply(text.tokens.df, 2, tf.idf, idf = text.tokens.idf)

        dim(text.tokens.tfidf)



      # Now, transpose the matrix back

        text.tokens.tfidf <- t(text.tokens.tfidf)

        dim(text.tokens.tfidf)



     # Cosine similarity using Text2Vec 

  similarity.sim2 <- sim2(text.tokensChina.tfidf, text.tokensChina.tfidf, method = "cosine", norm = "none")



  similarity.psim2 <- psim2(text.tokensChina.tfidf, text.tokensChina.tfidf, method = "cosine", norm = "none")

  similarity.psim2 <- as.data.frame(similarity.psim2)

Global Enviroment picture:
Picture of my screen with Global Environment + Psim2 Results

asked Nov 15 '18 at 18:43

Kamil Liskutin

I would like to ask you, if anybody could check my code, because it was behaving weird - not working, giving me errors to suddenly working without changing anything - the code will be at the bottom.

What was wrong at the start: When first running the functions, I was getting an error, probably telling me, that my lengths of columns in the two TF-IDF matrixes are not matched:

ncol(x) == ncol(y) is not TRUE

However, re-running the code for all my steps and then trying again, it worked, but I did not change anything ...

Results: Result for the function sim2 is weird table [1:45, 1:45]. Clearly not what I wanted - one column with the similarity between the speech of Country A and country B in given year.

Result for the function psim2 is better - one column with the results [not sure, how right they are though].

Code, picture:

    # *** Text Pre-Processing with Quanteda *** 

      # 1. Tokenization

      text.tokens <- tokens(docs$text, what = 'word',

                          remove_numbers = TRUE,

                          remove_punct = TRUE,

                          remove_symbols = TRUE,

                          remove_hyphens = TRUE)



      # 2. Transform words to lower case

      text.tokens <- tokens_tolower(text.tokens)



      # 3. Removing stop-words (Using quanteda's built-in stopwords list)

      text.tokens <- tokens_select(text.tokens, stopwords(),

                                   selection = 'remove')

      # 4. Perform stemming on the tokens.

      text.tokens <- tokens_wordstem(text.tokens, language = 'english')



      # 5. Create bag-of-words model / document feature(frequance)

      text.tokens.dfm <- dfm(text.tokens, tolower = FALSE)



      # 6. Transform to a matrix to work with and inspect

      text.tokens.matrix <- as.matrix(text.tokens.dfm)

      dim(text.tokens.matrix)



    # *** Doing TF-IDF *** 

      # Defining Function for calculating relative term frequency (TF)

      term.frequency <- function(row) {

        row / sum(row)

      }

      # Defining Function for calculating inverse document frequency (IDF)

      inverse.doc.freq <- function(col) {

        corpus.size <- length(col)

        doc.count <- length(which(col > 0))



        log10(corpus.size / doc.count)

      }

      # Defining function for calculating TD-IDF

      tf.idf <- function(tf, idf) {

        tf * idf

      }



      # 1. First step, normalize all documents via TF.

      text.tokens.df <- apply(text.tokens.matrix, 1, term.frequency)

      dim(text.tokens.df)



      # 2. Second step, calculate the IDF vector 

      text.tokens.idf <- apply(text.tokens.matrix, 2, inverse.doc.freq)

      str(text.tokens.idf)



      # 3. Lastly, calculate TF-IDF for our corpus

        # Apply function on columns, because matrix is transposed from TF function  

        text.tokens.tfidf <- apply(text.tokens.df, 2, tf.idf, idf = text.tokens.idf)

        dim(text.tokens.tfidf)



      # Now, transpose the matrix back

        text.tokens.tfidf <- t(text.tokens.tfidf)

        dim(text.tokens.tfidf)



     # Cosine similarity using Text2Vec 

  similarity.sim2 <- sim2(text.tokensChina.tfidf, text.tokensChina.tfidf, method = "cosine", norm = "none")



  similarity.psim2 <- psim2(text.tokensChina.tfidf, text.tokensChina.tfidf, method = "cosine", norm = "none")

  similarity.psim2 <- as.data.frame(similarity.psim2)

Global Enviroment picture:
Picture of my screen with Global Environment + Psim2 Results

r cosine-similarity linguistics quanteda text2vec

asked Nov 15 '18 at 18:43

Kamil Liskutin

asked Nov 15 '18 at 18:43

Kamil Liskutin

asked Nov 15 '18 at 18:43

Kamil Liskutin

asked Nov 15 '18 at 18:43

Kamil Liskutin

asked Nov 15 '18 at 18:43

Kamil Liskutin

This is not really the kind of question for Stack Overflow. You might do better to post this on Code Review

– G5W
Nov 15 '18 at 19:16

Oh thank you, was not aware of the code review :/

– Kamil Liskutin
Nov 15 '18 at 19:34

See quanteda::dfm_tfidf().

– Ken Benoit
Nov 15 '18 at 20:02

Yeah, I know you can skin most of the preprocessing code, I just wanted to have it step by them cuz I am new to it so I remember it. Afterward will be using that.

– Kamil Liskutin
Nov 15 '18 at 20:40

add a comment |

This is not really the kind of question for Stack Overflow. You might do better to post this on Code Review

– G5W
Nov 15 '18 at 19:16

Oh thank you, was not aware of the code review :/

– Kamil Liskutin
Nov 15 '18 at 19:34

See quanteda::dfm_tfidf().

– Ken Benoit
Nov 15 '18 at 20:02

Yeah, I know you can skin most of the preprocessing code, I just wanted to have it step by them cuz I am new to it so I remember it. Afterward will be using that.

– Kamil Liskutin
Nov 15 '18 at 20:40

This is not really the kind of question for Stack Overflow. You might do better to post this on Code Review

– G5W
Nov 15 '18 at 19:16

Oh thank you, was not aware of the code review :/

– Kamil Liskutin
Nov 15 '18 at 19:34

See quanteda::dfm_tfidf().

– Ken Benoit
Nov 15 '18 at 20:02

Yeah, I know you can skin most of the preprocessing code, I just wanted to have it step by them cuz I am new to it so I remember it. Afterward will be using that.

– Kamil Liskutin
Nov 15 '18 at 20:40

add a comment |

1 Answer
1

active

oldest

votes

Well, the outcome is, the whole thing is complete BS. Did not compare things in one vector space. Not to mention, the best method is to use doc2vec but I tried to figure it out for several days and got nowhere, unfortunately.

answered Nov 18 '18 at 1:01

Kamil Liskutin

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326014%2ftext-similarity-cosine-control%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Nov 18 '18 at 1:01

Kamil Liskutin

add a comment |

answered Nov 18 '18 at 1:01

Kamil Liskutin

add a comment |

answered Nov 18 '18 at 1:01

Kamil Liskutin

answered Nov 18 '18 at 1:01

Kamil Liskutin

answered Nov 18 '18 at 1:01

Kamil Liskutin

answered Nov 18 '18 at 1:01

Kamil Liskutin

answered Nov 18 '18 at 1:01

Kamil Liskutin

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

ekJRcPBl0J7p54LP,p 35xVAwdys2vlil9qtRi8VS3,pG0KHC wKSxQG2,yVDwoIy7iY0s,xy R,s9,sUg4mvGFWZ6D GtDT50vFu 1

搜尋此網誌

Vfrdtyky