Mutate(expression = c("base (loop)", "base (lapply)", "text2map", Tidytext::unnest_tokens(word, clean_text) |>ĭplyr::count(doc_id, word, sort = TRUE) |> For example, let's see how many cells are zeros in one of the dense DTMs we just created based on Star Trek: TNG scripts. Sparse DTMsĪs a result of the nature of language, DTMs tend to be very "sparse" -meaning, they have lots and lots of zeros. It's possible that with a larger corpus lapply will show more substantial gains over the for loop method. Similar, but our for loop function beat the lapply. Mutate(expression = c("loop", "lapply")) |> Both methods below will also use unique() for getting the unique tokens (i.e. For example, we could tokenize by every two word bi-gram or instead of a literal single space, we could use other kinds of whitespace (tabs, carriage returns, newlines, etc.). This also means it is very fast in comparison to more complex rules. Both methods below will use strsplit() with a literal, single space ( fixed = TRUE significantly speeds up this process). It is more efficient to build the DTM first, and then simply remove the columns that match a given stoplist.įinally, we can tokenize using a variety of rules. The function may also incorporate the removal or "stopping" of certain tokens. The first option will be the fastest and the third option being the slowest. Second, the columns of our DTM may be sorted by (1) the order they appear in the corpus, (2) alphabetic order, or (3) by their frequency in the corpus. Since all weightings require raw counts anyway, we will just stop at a count DTM (not to mention relative frequencies will turn an integer matrix into a real number matrix, which will result in a larger object in terms of memory). The most common is to normalize by the row count to get relative frequencies. Some functions may include the option to weight the matrix. First, the most basic DTM uses the raw counts of each word in a document. While the above are essential, there are a few optional steps which functions may or may not take by default. Finally, we will count each time we find a match between a token in a document with a token in the vocabulary. Second, from these lists of tokens, we need to extract only the unique tokens to create a vocabulary. There are three necessary steps: (1) tokenize, (2) create vocabulary, and (3) match and count.įirst, each document is split into list of individual tokens. To get started, let's create two base R methods for creating dense DTMs. Let's do a tiny bit of preprocessing (lowercasing, smooshing contractions, removing punctuation, numbers, and getting rid of extra spaces).Ĭlean_text = gsub("", "", clean_text),Ĭlean_text = gsub("]", " ", clean_text),Ĭlean_text = gsub(']+', " ", clean_text),Ĭlean_text = gsub("]+", " ", clean_text), Summarize(text = paste0(line, collapse = " "),įilter(series = "TNG") |> # limit to TNG Feel free to let me know if I've missed a package that creates a DTM. These include every single package that includes functions to create DTM that I could find (the koRpus package does provide a document-term matrix method, but I could not get it to work). (The dtm_builder() function was developed in tandem with writing the original comparison back in December 2020).īelow are the non-text analysis packages we'll be using.Īnd, here are the text analysis R packages we'll be using. One of these is an R package, text2map, that I developed with Marshall Taylor. The rest are from eleven text analysis packages. Two are custom functions written in base R. The cells of the matrix are typically a count of how many times each unique word occurs in a given document (often called tokens).īelow, I attempt a comprehensive overview and comparison of 15 different methods for creating a DTM. What is a DTM? It is a matrix with rows and columns, where each document in some sample of texts (called a corpus) are the rows and the columns are all the unique words (often called types or vocabulary) in the corpus. The Document-Term Matrix (DTM) is the foundation of computational text analysis, and as a result there are several R packages that provide a means to build one.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |