Package 'tosca'

Title:	Tools for Statistical Content Analysis
Description:	A framework for statistical analysis in content analysis. In addition to a pipeline for preprocessing text corpora and linking to the latent Dirichlet allocation from the 'lda' package, plots are offered for the descriptive analysis of text corpora and topic models. In addition, an implementation of Chang's intruder words and intruder topics is provided. Sample data for the vignette is included in the toscaData package, which is available on gitHub: <https://github.com/Docma-TU/toscaData>.
Authors:	Lars Koppers [aut, cre] , Jonas Rieger [aut] , Karin Boczek [ctb] , Gerret von Nordheim [ctb]
Maintainer:	Lars Koppers <[email protected]>
License:	GPL (>= 2)
Version:	0.3-1
Built:	2025-03-12 05:33:37 UTC
Source:	https://github.com/docma-tu/tosca

Help Index

Transform textmeta to corpus
"meta" Component of "textmeta"-Objects
Transform corpus to textmeta
Data Preprocessing
Cluster Analysis
Deletes and Renames Articles with the same ID
Creating List of Duplicates
Subcorpus With Count Filter
Subcorpus With Date Filter
Subcorpus With ID Filter
Subcorpus With Word Filter
Function to validate the fit of the LDA model
Function to validate the fit of the LDA model
Function to fit LDA model
Create Lda-ready Dataset
Counts Words in Text Corpora
Preparation of Different LDAs For Clustering
Merge Textmeta Objects
Plotting topics over time as stacked areas below plotted lines.
Plotting Counts of specified Wordgroups over Time (relative to Corpus)
Plotting Topics over Time relative to Corpus
Plots Counts of Documents or Words over Time (relative to Corpus)
Plotting Counts of Topics over Time (Relative to Corpus)
Plotting Counts of Topics-Words-Combination over Time (Relative to Words)
Plots Counts of Topics-Words-Combination over Time (Relative to Topics)
Plotting Counts/Proportion of Words/Docs in LDA-generated Topic-Subcorpora over Time
Precision and Recall
Read Corpora as CSV
Read WhatsApp files
Read Pages from Wikipedia
Read files from Wikinews
Removes XML/HTML Tags and Umlauts
Sample Texts
Export Readable Meta-Data of Articles.
Exports Readable Text Lists
"textmeta"-Objects
Transform textmeta to an object with tidy text data
Calculating Topic Coherence
Coloring the words of a text corresponding to topic allocation
Get The IDs Of The Most Representive Texts
Top Words per Topic

Transform textmeta to corpus

Description

Transfers data from a textmeta object to a corpus object - the way text data is stored in the package quanteda.

Usage

as.corpus.textmeta(
  object,
  docnames = "id",
  docvars = setdiff(colnames(object$meta), "id"),
  ...
)
as.corpus.textmeta(
  object,
  docnames = "id",
  docvars = setdiff(colnames(object$meta), "id"),
  ...
)

Arguments

`object`	`textmeta` object
`docnames`	Character: string with the column of `object$meta` which should be kept as `docnames`.
`docvars`	Character: vector with columns of `object$meta` which should be kept as `docvars`.
`...`	Additional parameters like `meta` or `compress` for `corpus`.

Value

corpus object

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
 Teach a Man To Fish, and You Feed Him for a Lifetime",
 B="So Long, and Thanks for All the Fish",
 C="A very able manipulative mathematician, Fisher enjoys a real mastery
 in evaluating complicated multiple integrals.")

obj <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
 title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
 date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
 additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corp <- as.corpus.textmeta(obj)
quanteda::docvars(corp)
#quanteda::textstat_summary(corp)
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
 Teach a Man To Fish, and You Feed Him for a Lifetime",
 B="So Long, and Thanks for All the Fish",
 C="A very able manipulative mathematician, Fisher enjoys a real mastery
 in evaluating complicated multiple integrals.")

obj <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
 title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
 date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
 additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corp <- as.corpus.textmeta(obj)
quanteda::docvars(corp)
#quanteda::textstat_summary(corp)

"meta" Component of "textmeta"-Objects

Description

Helper to create the requested data.frame to create a "textmeta" object.

Usage

as.meta(
  x,
  cols = colnames(x),
  idCol = "id",
  dateCol = "date",
  titleCol = "title",
  dateFormat
)
as.meta(
  x,
  cols = colnames(x),
  idCol = "id",
  dateCol = "date",
  titleCol = "title",
  dateFormat
)

Arguments

`x`	data.frame to convert
`cols`	`character` vector with columns which should be kept
`idCol`	`character` string with column name of the IDs
`dateCol`	`character` string with column name of the Dates
`titleCol`	`character` string with column name of the Titles
`dateFormat`	`character` string with the date format in x for `as.Date`. If not supplied, dates are not transformed.

Value

A data.frame with columns "id", "date", "title" and user-specified others.

Examples

meta <- data.frame(id = 1:3, additionalVariable = matrix(5, ncol = 4, nrow = 3))
(as.meta(meta))

meta <- data.frame(id = 1:3, additionalVariable = matrix(5, ncol = 4, nrow = 3))
(as.meta(meta))

Transform corpus to textmeta

Description

Transfers data from a corpus object - the way text data is stored in the package quanteda - to a textmeta object.

Usage

as.textmeta.corpus(
  corpus,
  cols,
  dateFormat = "%Y-%m-%d",
  idCol = "id",
  dateCol = "date",
  titleCol = "title",
  textCol = "texts",
  duplicateAction = TRUE,
  addMetadata = TRUE
)
as.textmeta.corpus(
  corpus,
  cols,
  dateFormat = "%Y-%m-%d",
  idCol = "id",
  dateCol = "date",
  titleCol = "title",
  textCol = "texts",
  duplicateAction = TRUE,
  addMetadata = TRUE
)

Arguments

`corpus`	Object of class `corpus`, package `quanteda`.
`cols`	Character: vector with columns which should be kept.
`dateFormat`	Character: string with the date format in the date column for `as.Date`.
`idCol`	Character: string with column name of the IDs in corpus - named "id" in the resulting data.frame.
`dateCol`	Character: string with column name of the Dates in corpus - named "date" in the resulting data.frame.
`titleCol`	Character: string with column name of the Titles in corpus - named "title" in the resulting data.frame.
`textCol`	Character: string with column name of the Texts in corpus - results in a named list ("id") of the Texts.
`duplicateAction`	Logical: Should `deleteAndRenameDuplicates` be applied to the created `textmeta` object?
`addMetadata`	Logical: Should the metadata flag of corpus be added to the meta flag of the `textmeta` object? If there are conflicts regarding the naming of columns, the metadata columns would be overwritten by the document specific columns.

Value

textmeta object

Examples

texts <- c("Give a Man a Fish, and You Feed Him for a Day.
 Teach a Man To Fish, and You Feed Him for a Lifetime",
 "So Long, and Thanks for All the Fish",
 "A very able manipulative mathematician, Fisher enjoys a real mastery
 in evaluating complicated multiple integrals.")

corp <- quanteda::corpus(x = texts)
obj <- as.textmeta.corpus(corp, addMetadata = FALSE)

quanteda::docvars(corp, "title") <- c("Fishing", "Don't panic!", "Sir Ronald")
quanteda::docvars(corp, "date") <- c("1885-01-02", "1979-03-04", "1951-05-06")
quanteda::docvars(corp, "id") <- c("A", "B", "C")
quanteda::docvars(corp, "additionalVariable") <- 1:3

obj <- as.textmeta.corpus(corp)
texts <- c("Give a Man a Fish, and You Feed Him for a Day.
 Teach a Man To Fish, and You Feed Him for a Lifetime",
 "So Long, and Thanks for All the Fish",
 "A very able manipulative mathematician, Fisher enjoys a real mastery
 in evaluating complicated multiple integrals.")

corp <- quanteda::corpus(x = texts)
obj <- as.textmeta.corpus(corp, addMetadata = FALSE)

quanteda::docvars(corp, "title") <- c("Fishing", "Don't panic!", "Sir Ronald")
quanteda::docvars(corp, "date") <- c("1885-01-02", "1979-03-04", "1951-05-06")
quanteda::docvars(corp, "id") <- c("A", "B", "C")
quanteda::docvars(corp, "additionalVariable") <- 1:3

obj <- as.textmeta.corpus(corp)

Data Preprocessing

Description

Removes punctuation, numbers and stopwords, changes letters into lowercase and tokenizes.

Usage

cleanTexts(
  object,
  text,
  sw = "en",
  paragraph = FALSE,
  lowercase = TRUE,
  rmPunctuation = TRUE,
  rmNumbers = TRUE,
  checkUTF8 = TRUE,
  ucp = TRUE
)
cleanTexts(
  object,
  text,
  sw = "en",
  paragraph = FALSE,
  lowercase = TRUE,
  rmPunctuation = TRUE,
  rmNumbers = TRUE,
  checkUTF8 = TRUE,
  ucp = TRUE
)

Arguments

`object`	`textmeta` object
`text`	Not necassary if `object` is specified, else should be `object\$text`: List of article texts.
`sw`	Character: Vector of stopwords. If the vector is of length one, `sw` is interpreted as argument for `stopwords` from the tm package.
`paragraph`	Logical: Should be set to `TRUE` if one article is a list of character strings, representing the paragraphs.
`lowercase`	Logical: Should be set to `TRUE` if all letters should be coerced to lowercase.
`rmPunctuation`	Logical: Should be set to `TRUE` if punctuation should be removed from articles.
`rmNumbers`	Logical: Should be set to `TRUE` if numbers should be removed from articles.
`checkUTF8`	Logical: Should be set to `TRUE` if articles should be tested on UTF-8 - which is package standard.
`ucp`	Logical: ucp option for `removePunctuation` from the tm package. Runs remove punctuation twice (ASCII and Unicode).

Details

Removes punctuation, numbers and stopwords, change into lowercase letters and tokenization. Additional some cleaning steps: remove empty words / paragraphs / article.

Value

A textmeta object or a list (if object is not specified) containing the preprocessed articles.

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

cleanTexts(object=corpus)

texts <- list(A=c("Give a Man a Fish, and You Feed Him for a Day.",
"Teach a Man To Fish, and You Feed Him for a Lifetime"),
B="So Long, and Thanks for All the Fish",
C=c("A very able manipulative mathematician,",
"Fisher enjoys a real mastery in evaluating complicated multiple integrals."))

cleanTexts(text=texts, sw = "en", paragraph = TRUE)

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

cleanTexts(object=corpus)

texts <- list(A=c("Give a Man a Fish, and You Feed Him for a Day.",
"Teach a Man To Fish, and You Feed Him for a Lifetime"),
B="So Long, and Thanks for All the Fish",
C=c("A very able manipulative mathematician,",
"Fisher enjoys a real mastery in evaluating complicated multiple integrals."))

cleanTexts(text=texts, sw = "en", paragraph = TRUE)

Cluster Analysis

Description

This function makes a cluster analysis using the Hellinger distance.

Usage

clusterTopics(
  ldaresult,
  file,
  tnames = NULL,
  method = "average",
  width = 30,
  height = 15,
  ...
)
clusterTopics(
  ldaresult,
  file,
  tnames = NULL,
  method = "average",
  width = 30,
  height = 15,
  ...
)

Arguments

`ldaresult`	The result of a function call `LDAgen` - alternatively the corresponding matrix `result$topics`
`file`	File for the dendogram pdf.
`tnames`	Character vector as label for the topics.
`method`	Method statement from `hclust`
`width`	Grafical parameter for pdf output. See `pdf`
`height`	Grafical parameter for pdf output. See `pdf`
`...`	Additional parameter for `plot`

Details

This function is useful to analyze topic similarities and while evaluating the right number of topics of LDAs.

Value

A dendogram as pdf and a list containing

`dist`	A distance matrix
`clust`	The result from `hclust`

Examples


texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
ldaPrep <- LDAprep(text=corpus$text, vocab=wordlist$words)

LDA <- LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)
clusterTopics(ldaresult=LDA)

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
ldaPrep <- LDAprep(text=corpus$text, vocab=wordlist$words)

LDA <- LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)
clusterTopics(ldaresult=LDA)

Deletes and Renames Articles with the same ID

Description

Deletes articles with the same ID and same text. Renames the ID of articles with the same ID but different text-component (_IDFakeDup, _IDRealDup).

Usage

deleteAndRenameDuplicates(object, renameRemaining = TRUE)
deleteAndRenameDuplicates(object, renameRemaining = TRUE)

Arguments

`object`	A `textmeta` object as a result of a read-function.
`renameRemaining`	Logical: Should all articles for which a counterpart with the same id exists, but which do not have the same text and - in addition - which matches (an)other article(s) in the text field be named a "fake duplicate" or not.

Details

Summary: Different types of duplicates: "complete duplicates" = same ID, same information in text, same information in meta "real duplicates" = same ID, same information in text, different information in meta "fake duplicates" = same ID, different information in text

Value

A filtered textmeta object with updated IDs.

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
A="A fake duplicate",
B="So Long, and Thanks for All the Fish",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "A", "B", "B", "C", "C"),
title=c("Fishing", "Fake duplicate", "Don't panic!", "towel day", "Sir Ronald", "Sir Ronald"),
date=c("1885-01-02", "1885-01-03", "1979-03-04", "1979-03-05", "1951-05-06", "1951-05-06"),
stringsAsFactors=FALSE), text=texts)

duplicates <- deleteAndRenameDuplicates(object=corpus)
duplicates$meta$id

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
A="A fake duplicate",
B="So Long, and Thanks for All the Fish",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "A", "A", "B", "B", "C", "C"),
title=c("Fishing", "Fishing2", "Fake duplicate", "Don't panic!", "towel day",
"Sir Ronald", "Sir Ronald"),
date=c("1885-01-02", "1885-01-02", "1885-01-03", "1979-03-04", "1979-03-05",
"1951-05-06", "1951-05-06"),
stringsAsFactors=FALSE), text=texts)

duplicates <- deleteAndRenameDuplicates(object=corpus)
duplicates2 <- deleteAndRenameDuplicates(object=corpus, renameRemaining = FALSE)

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
A="A fake duplicate",
B="So Long, and Thanks for All the Fish",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "A", "B", "B", "C", "C"),
title=c("Fishing", "Fake duplicate", "Don't panic!", "towel day", "Sir Ronald", "Sir Ronald"),
date=c("1885-01-02", "1885-01-03", "1979-03-04", "1979-03-05", "1951-05-06", "1951-05-06"),
stringsAsFactors=FALSE), text=texts)

duplicates <- deleteAndRenameDuplicates(object=corpus)
duplicates$meta$id

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
A="A fake duplicate",
B="So Long, and Thanks for All the Fish",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "A", "A", "B", "B", "C", "C"),
title=c("Fishing", "Fishing2", "Fake duplicate", "Don't panic!", "towel day",
"Sir Ronald", "Sir Ronald"),
date=c("1885-01-02", "1885-01-02", "1885-01-03", "1979-03-04", "1979-03-05",
"1951-05-06", "1951-05-06"),
stringsAsFactors=FALSE), text=texts)

duplicates <- deleteAndRenameDuplicates(object=corpus)
duplicates2 <- deleteAndRenameDuplicates(object=corpus, renameRemaining = FALSE)

Creating List of Duplicates

Description

Creates a List of different types of Duplicates in a textmeta-object.

Usage

duplist(object, paragraph = FALSE)

is.duplist(x)

## S3 method for class 'duplist'
print(x, ...)

## S3 method for class 'duplist'
summary(object, ...)
duplist(object, paragraph = FALSE)

is.duplist(x)

## S3 method for class 'duplist'
print(x, ...)

## S3 method for class 'duplist'
summary(object, ...)

Arguments

`object`	A textmeta-object.
`paragraph`	Logical: Should be set to `TRUE` if the article is a list of character strings, representing the paragraphs.
`x`	An R Object.
`...`	Further arguments for print and summary. Not implemented.

Details

This function helps to identify different types of Duplicates and gives the ability to exclude these for further Analysis (e.g. LDA).

Value

Named List:

`uniqueTexts`	Character vector of IDs so that each text occurs once - if a text occurs twice or more often in the corpus, the ID of the first text regarding the list-order is returned
`notDuplicatedTexts`	Character vector of IDs of texts which are represented only once in the whole corpus
`idFakeDups`	List of character vectors: IDs of texts which originally has the same ID but belongs to different texts grouped by their original ID
`idRealDups`	List of character vectors: IDs of texts which originally has the same ID and text but different meta information grouped by their original ID
`allTextDups`	List of character vectors: IDs of texts which occur twice or more often grouped by text equality
`textMetaDups`	List of character vectors: IDs of texts which occur twice or more often and have the same meta information grouped by text and meta equality

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
A="A fake duplicate",
B="So Long, and Thanks for All the Fish",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "A", "B", "B", "C", "C"),
title=c("Fishing", "Fake duplicate", "Don't panic!", "towel day", "Sir Ronald", "Sir Ronald"),
date=c("1885-01-02", "1885-01-03", "1979-03-04", "1979-03-05", "1951-05-06", "1951-05-06"),
stringsAsFactors=FALSE), text=texts)

duplicates <- deleteAndRenameDuplicates(object=corpus)
duplist(object=duplicates, paragraph = FALSE)
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
A="A fake duplicate",
B="So Long, and Thanks for All the Fish",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "A", "B", "B", "C", "C"),
title=c("Fishing", "Fake duplicate", "Don't panic!", "towel day", "Sir Ronald", "Sir Ronald"),
date=c("1885-01-02", "1885-01-03", "1979-03-04", "1979-03-05", "1951-05-06", "1951-05-06"),
stringsAsFactors=FALSE), text=texts)

duplicates <- deleteAndRenameDuplicates(object=corpus)
duplist(object=duplicates, paragraph = FALSE)

Subcorpus With Count Filter

Description

Generates a subcorpus by restricting it to texts containing a specific number of words.

Usage

filterCount(...)

## Default S3 method:
filterCount(text, count = 1L, out = c("text", "bin", "count"), ...)

## S3 method for class 'textmeta'
filterCount(
  object,
  count = 1L,
  out = c("text", "bin", "count"),
  filtermeta = TRUE,
  ...
)
filterCount(...)

## Default S3 method:
filterCount(text, count = 1L, out = c("text", "bin", "count"), ...)

## S3 method for class 'textmeta'
filterCount(
  object,
  count = 1L,
  out = c("text", "bin", "count"),
  filtermeta = TRUE,
  ...
)

Arguments

`...`	Not used.
`text`	Not necassary if `object` is specified, else should be `object$text`: list of article texts
`count`	An integer marking how many words must at least be found in the text.
`out`	Type of output: `text` filtered corpus, `bin` logical vector for all texts, `count` the counts.
`object`	A `textmeta` object
`filtermeta`	Logical: Should the meta component be filtered, too?

Value

textmeta object if object is specified, else only the filtered text. If a textmeta object is returned its meta data are filtered to those texts which appear in the corpus by default (filtermeta).

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

filterCount(text=texts, count=10L)
filterCount(text=texts, count=10L, out="bin")
filterCount(text=texts, count=10L, out="count")
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

filterCount(text=texts, count=10L)
filterCount(text=texts, count=10L, out="bin")
filterCount(text=texts, count=10L, out="count")

Subcorpus With Date Filter

Description

Generates a subcorpus by restricting it to a specific time window.

Usage

filterDate(...)

## Default S3 method:
filterDate(
  text,
  meta,
  s.date = min(meta$date, na.rm = TRUE),
  e.date = max(meta$date, na.rm = TRUE),
  ...
)

## S3 method for class 'textmeta'
filterDate(
  object,
  s.date = min(object$meta$date, na.rm = TRUE),
  e.date = max(object$meta$date, na.rm = TRUE),
  filtermeta = TRUE,
  ...
)
filterDate(...)

## Default S3 method:
filterDate(
  text,
  meta,
  s.date = min(meta$date, na.rm = TRUE),
  e.date = max(meta$date, na.rm = TRUE),
  ...
)

## S3 method for class 'textmeta'
filterDate(
  object,
  s.date = min(object$meta$date, na.rm = TRUE),
  e.date = max(object$meta$date, na.rm = TRUE),
  filtermeta = TRUE,
  ...
)

Arguments

`...`	Not used.
`text`	Not necessary if `object` is specified, else should be `object$text`
`meta`	Not necessary if `object` is specified, else should be `object$meta`
`s.date`	Start date of subcorpus as date object
`e.date`	End date of subcorpus as date object
`object`	`textmeta` object
`filtermeta`	Logical: Should the meta component be filtered, too?

Value

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

subcorpus <- filterDate(object=corpus, s.date = "1951-05-06")
subcorpus$meta
subcorpus$text
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

subcorpus <- filterDate(object=corpus, s.date = "1951-05-06")
subcorpus$meta
subcorpus$text

Subcorpus With ID Filter

Description

Generates a subcorpus by restricting it to specific ids.

Usage

filterID(...)

## Default S3 method:
filterID(text, id, ...)

## S3 method for class 'textmeta'
filterID(object, id, filtermeta = TRUE, ...)
filterID(...)

## Default S3 method:
filterID(text, id, ...)

## S3 method for class 'textmeta'
filterID(object, id, filtermeta = TRUE, ...)

Arguments

`...`	Not used.
`text`	Not necassary if `object` is specified, else should be `object$text`: list of article texts
`id`	Character: IDs the corpus should be filtered to.
`object`	A `textmeta` object
`filtermeta`	Logical: Should the meta component be filtered, too?

Value

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

meta <- data.frame(id = c("C", "B"), date = NA, title = c("Fisher", "Fish"),
stringsAsFactors = FALSE)
tm <- textmeta(text = texts, meta = meta)

filterID(texts, c("A", "B"))
filterID(texts, "C")
filterID(tm, "C")
filterID(tm, "B")
filterID(tm, c("B", "A"), FALSE)
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

meta <- data.frame(id = c("C", "B"), date = NA, title = c("Fisher", "Fish"),
stringsAsFactors = FALSE)
tm <- textmeta(text = texts, meta = meta)

filterID(texts, c("A", "B"))
filterID(texts, "C")
filterID(tm, "C")
filterID(tm, "B")
filterID(tm, c("B", "A"), FALSE)

Subcorpus With Word Filter

Description

Generates a subcorpus by restricting it to texts containing specific filter words.

Usage

filterWord(...)

## Default S3 method:
filterWord(
  text,
  search,
  ignore.case = FALSE,
  out = c("text", "bin", "count"),
  ...
)

## S3 method for class 'textmeta'
filterWord(
  object,
  search,
  ignore.case = FALSE,
  out = c("text", "bin", "count"),
  filtermeta = TRUE,
  ...
)
filterWord(...)

## Default S3 method:
filterWord(
  text,
  search,
  ignore.case = FALSE,
  out = c("text", "bin", "count"),
  ...
)

## S3 method for class 'textmeta'
filterWord(
  object,
  search,
  ignore.case = FALSE,
  out = c("text", "bin", "count"),
  filtermeta = TRUE,
  ...
)

Arguments

`...`	Not used.
`text`	Not necessary if `object` is specified, else should be `object$text`: list of article texts.
`search`	List of data frames. Every List element is an 'or' link, every entry in a data frame is linked by an 'and'. The dataframe must have following tree variables: `pattern` a character string including the search terms, `word`, a logical value displaying if a word (TRUE) or character (search) is wanted and `count` an integer marking how many times the word must at least be found in the text. `word` can alternatively be a character string containing the keywords `pattern` for character search, `word` for word-search and `left` and `right` for truncated search. If `search` is only a character Vector the link is 'or', and a character search will be used with `count=1`
`ignore.case`	Logical: Lower and upper case will be ignored.
`out`	Type of output: `text` filtered corpus, `bin` logical vector for all texts, `count` the number of matches.
`object`	A `textmeta` object
`filtermeta`	Logical: Should the meta component be filtered, too?

Value

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

# search for pattern "fish"
filterWord(text=texts, search="fish", ignore.case=TRUE)

# search for word "fish"
filterWord(text=texts, search=data.frame(pattern="fish", word="word", count=1),
ignore.case=TRUE)

# pattern must appear at least two times
filterWord(text=texts, search=data.frame(pattern="fish", word="pattern", count=2),
ignore.case=TRUE)

# search for "fish" AND "day"
filterWord(text=texts, search=data.frame(pattern=c("fish", "day"), word="word", count=1),
ignore.case=TRUE)

# search for "Thanks" OR "integrals"
filterWord(text=texts, search=list(data.frame(pattern="Thanks", word="word", count=1),
data.frame(pattern="integrals", word="word", count=1)))

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

# search for pattern "fish"
filterWord(text=texts, search="fish", ignore.case=TRUE)

# search for word "fish"
filterWord(text=texts, search=data.frame(pattern="fish", word="word", count=1),
ignore.case=TRUE)

# pattern must appear at least two times
filterWord(text=texts, search=data.frame(pattern="fish", word="pattern", count=2),
ignore.case=TRUE)

# search for "fish" AND "day"
filterWord(text=texts, search=data.frame(pattern=c("fish", "day"), word="word", count=1),
ignore.case=TRUE)

# search for "Thanks" OR "integrals"
filterWord(text=texts, search=list(data.frame(pattern="Thanks", word="word", count=1),
data.frame(pattern="integrals", word="word", count=1)))

Function to validate the fit of the LDA model

Description

This function validates a LDA result by presenting a mix of topics and intruder topics to a human user, who has to identity them.

Usage

intruderTopics(
  text = NULL,
  beta = NULL,
  theta = NULL,
  id = NULL,
  numIntruder = 1,
  numOuttopics = 4,
  byScore = TRUE,
  minWords = 0L,
  minOuttopics = 0L,
  stopTopics = NULL,
  printSolution = FALSE,
  oldResult = NULL,
  test = FALSE,
  testinput = NULL
)
intruderTopics(
  text = NULL,
  beta = NULL,
  theta = NULL,
  id = NULL,
  numIntruder = 1,
  numOuttopics = 4,
  byScore = TRUE,
  minWords = 0L,
  minOuttopics = 0L,
  stopTopics = NULL,
  printSolution = FALSE,
  oldResult = NULL,
  test = FALSE,
  testinput = NULL
)

Arguments

`text`	A list of texts (e.g. the text element of a `textmeta` object).
`beta`	A matrix of word-probabilities or frequency table for the topics (e.g. the `topics` matrix from the `LDAgen` result). Each row is a topic, each column a word. The rows will be divided by the row sums, if they are not 1.
`theta`	A matrix of wordcounts per text and topic (e.g. the `document_sums` matrix from the `LDAgen` result). Each row is a topic, each column a text. In each cell stands the number of words in text j belonging to topic i.
`id`	Optional: character vector of text IDs that should be used for the function. Useful to start a inchoate coding task.
`numIntruder`	Intended number of intruder words. If `numIntruder` is a integer vector, the number would be sampled for each topic.
`numOuttopics`	tba Integer: Number of words per topic, including the intruder words
`byScore`	Logical: Should the score of `top.topic.words` from the `lda` package be used?
`minWords`	Integer: Minimum number of words for a choosen text.
`minOuttopics`	Integer: Minimal number of words a topic needs to be classified as a possible correct Topic.
`stopTopics`	Optional: Integer vector to deselect stopword topics for the coding task.
`printSolution`	Logical: If `TRUE` the coder gets a feedback after his/her vote.
`oldResult`	Result object from an unfinished run of `intruderWords`. If oldResult is used, all other parameter will be ignored.
`test`	Logical: Enables test mode
`testinput`	Input for function tests

Value

Object of class IntruderTopics. List of 11

`result`	Matrix of 3 columns. Each row represents one labeled text. `numIntruder` (1. column) gives the number of intruder topics inputated in this text, `missIntruder` (2. column) the number of the intruder topics which were not found by the coder and `falseIntruder` (3. column) the number of the topics choosen by the coder which were no intruder.
`beta`	Parameter of the function call
`theta`	Parameter of the function call
`id`	Charater Vector of IDs at the beginning
`byScore`	Parameter of the function call
`numIntruder`	Parameter of the function call
`numOuttopics`	Parameter of the function call
`minWords`	Parameter of the function call
`minOuttopics`	Parameter of the function call
`unusedID`	Character vector of unused text IDs for the next run
`stopTopics`	Parameter of the function call

References

Chang, Jonathan and Sean Gerrish and Wang, Chong and Jordan L. Boyd-graber and David M. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. Advances in Neural Information Processing Systems, 2009.

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
intruder <- intruderTopics(text=politics$text, beta=LDAresult$topics,
                           theta=LDAresult$document_sums, id=names(poliLDA))

## End(Not run)
## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
intruder <- intruderTopics(text=politics$text, beta=LDAresult$topics,
                           theta=LDAresult$document_sums, id=names(poliLDA))

## End(Not run)

Function to validate the fit of the LDA model

Description

This function validates a LDA result by presenting a mix of words from a topic and intruder words to a human user, who has to identity them.

Usage

intruderWords(
  beta = NULL,
  byScore = TRUE,
  numTopwords = 30L,
  numIntruder = 1L,
  numOutwords = 5L,
  noTopic = TRUE,
  printSolution = FALSE,
  oldResult = NULL,
  test = FALSE,
  testinput = NULL
)
intruderWords(
  beta = NULL,
  byScore = TRUE,
  numTopwords = 30L,
  numIntruder = 1L,
  numOutwords = 5L,
  noTopic = TRUE,
  printSolution = FALSE,
  oldResult = NULL,
  test = FALSE,
  testinput = NULL
)

Arguments

`beta`	A matrix of word-probabilities or frequency table for the topics (e.g. the `topics` matrix from the `LDAgen` result). Each row is a topic, each column a word. The rows will be divided by the row sums, if they are not 1.
`byScore`	Logical: Should the score of `top.topic.words` from the `lda` package be used?
`numTopwords`	The number of topwords to be used for the intruder words
`numIntruder`	Intended number of intruder words. If `numIntruder` is a integer vector, the number would be sampled for each topic.
`numOutwords`	Integer: Number of words per topic, including the intruder words.
`noTopic`	Logical: Is `x` input allowed to mark nonsense topics?
`printSolution`	tba
`oldResult`	Result object from an unfinished run of `intruderWords`. If oldResult is used, all other parameter will be ignored.
`test`	Logical: Enables test mode
`testinput`	Input for function tests

Value

Object of class IntruderWords. List of 7

`result`	Matrix of 3 columns. Each row represents one topic. All values are 0 if the topic did not run before. `numIntruder` (1. column) gives the number of intruder words inputated in this topic, `missIntruder` (2. column) the number of the intruder words which were not found by the coder and `falseIntruder` (3. column) the number of the words choosen by the coder which were no intruder.
`beta`	Parameter of the function call
`byScore`	Parameter of the function call
`numTopwords`	Parameter of the function call
`numIntruder`	Parameter of the function call
`numOutwords`	Parameter of the function call
`noTopic`	Parameter of the function call

References

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
intruder <- intruderWords(beta=LDAresult$topics)
## End(Not run)
## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
intruder <- intruderWords(beta=LDAresult$topics)
## End(Not run)

Function to fit LDA model

Description

This function uses the lda.collapsed.gibbs.sampler from the lda- package and additionally saves topword lists and a R workspace.

Usage

LDAgen(
  documents,
  K = 100L,
  vocab,
  num.iterations = 200L,
  burnin = 70L,
  alpha = NULL,
  eta = NULL,
  seed = NULL,
  folder = file.path(tempdir(), "lda-result"),
  num.words = 50L,
  LDA = TRUE,
  count = FALSE
)
LDAgen(
  documents,
  K = 100L,
  vocab,
  num.iterations = 200L,
  burnin = 70L,
  alpha = NULL,
  eta = NULL,
  seed = NULL,
  folder = file.path(tempdir(), "lda-result"),
  num.words = 50L,
  LDA = TRUE,
  count = FALSE
)

Arguments

`documents`	A list prepared by `LDAprep`.
`K`	Number of topics
`vocab`	Character vector containing the words in the corpus
`num.iterations`	Number of iterations for the gibbs sampler
`burnin`	Number of iterations for the burnin
`alpha`	Hyperparameter for the topic proportions
`eta`	Hyperparameter for the word distributions
`seed`	A seed for reproducability.
`folder`	File for the results. Saves in the temporary directionary by default.
`num.words`	Number of words in the top topic words list
`LDA`	logical: Should a new model be fitted or an existing R workspace?
`count`	logical: Should article counts calculated per top topic words be used for output as csv (default: `FALSE`)?

Value

A .csv file containing the topword list and a R workspace containing the result data.

References

Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.

Jonathan Chang (2012). lda: Collapsed Gibbs sampling methods for topic models.. R package version 1.3.2. http://CRAN.R-project.org/package=lda

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
ldaPrep <- LDAprep(text=corpus$text, vocab=wordlist$words)

LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
ldaPrep <- LDAprep(text=corpus$text, vocab=wordlist$words)

LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)

Create Lda-ready Dataset

Description

This function transforms a text corpus such as the result of cleanTexts into the form needed by the lda-package.

Usage

LDAprep(text, vocab, reduce = TRUE)
LDAprep(text, vocab, reduce = TRUE)

Arguments

`text`	A list of tokenized texts
`vocab`	A character vector containing all words which should beused for lda
`reduce`	Logical: Should empty texts be deleted?

Value

A list in which every entry contains a matrix with two rows: The first row gives the number of the entry of the word in vocab minus one, the second row is 1 and the number of the occurrence of the word will be shown by the number of columns belonging to this word.

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
LDAprep(text=corpus$text, vocab=wordlist$words, reduce = TRUE)

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
LDAprep(text=corpus$text, vocab=wordlist$words, reduce = TRUE)

Counts Words in Text Corpora

Description

Creates a wordlist and a frequency table.

Usage

makeWordlist(text, k = 100000L, ...)
makeWordlist(text, k = 100000L, ...)

Arguments

`text`	List of texts.
`k`	Integer: How many texts should be processed at once (RAM usage)?
`...`	further arguments for the sort function. Often you want to set `method = "radix"`.

Details

This function helps, if table(x) needs too much RAM.

Value

`words`	An alphabetical list of the words in the corpus
`wordtable`	A frequency table of the words in the corpus

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

texts <- cleanTexts(text=texts)
makeWordlist(text=texts, k = 2L)

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

texts <- cleanTexts(text=texts)
makeWordlist(text=texts, k = 2L)

Preparation of Different LDAs For Clustering

Description

Merges different lda-results to one matrix, including only the words which appears in all lda-results.

Usage

mergeLDA(x)
mergeLDA(x)

Arguments

`x`	A list of lda results.

Details

The function is useful for merging lda-results prior to a cluster analysis with clusterTopics.

Value

A matrix including all topics from all lda-results. The number of rows is the number of topics, the number of columns is the number of words which appear in all results.

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
ldaPrep <- LDAprep(text=corpus$text, vocab=wordlist$words)

LDA1 <- LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)
LDA2 <- LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)
mergeLDA(list(LDA1=LDA1, LDA2=LDA2))
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
ldaPrep <- LDAprep(text=corpus$text, vocab=wordlist$words)

LDA1 <- LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)
LDA2 <- LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)
mergeLDA(list(LDA1=LDA1, LDA2=LDA2))

Merge Textmeta Objects

Description

Merges a list of textmeta objects to a single object. It is possible to control whether all columns or the intersect should be considered.

Usage

mergeTextmeta(x, all = TRUE)
mergeTextmeta(x, all = TRUE)

Arguments

`x`	A list of `textmeta` objects
`all`	Logical: Should the result contain `union` (`TRUE`) or `intersection` (`FALSE`) of columns of all objects? If `TRUE`, the columns which at least appear in one of the meta components are filled with `NA`s in the merged meta component.

Value

textmeta object

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus2 <- textmeta(meta=data.frame(id=c("E", "F"),
title=c("title1", "title2"),
date=c("2018-01-01", "2018-01-01"),
additionalVariable2=1:2, stringsAsFactors=FALSE), text=list(E="text1", F="text2"))

merged <- mergeTextmeta(x=list(corpus, corpus2), all = TRUE)
str(merged$meta)

merged <- mergeTextmeta(x=list(corpus, corpus2), all = FALSE)
str(merged$meta)
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus2 <- textmeta(meta=data.frame(id=c("E", "F"),
title=c("title1", "title2"),
date=c("2018-01-01", "2018-01-01"),
additionalVariable2=1:2, stringsAsFactors=FALSE), text=list(E="text1", F="text2"))

merged <- mergeTextmeta(x=list(corpus, corpus2), all = TRUE)
str(merged$meta)

merged <- mergeTextmeta(x=list(corpus, corpus2), all = FALSE)
str(merged$meta)

Plotting topics over time as stacked areas below plotted lines.

Description

Creates a stacked area plot of all or selected topics.

Usage

plotArea(
  ldaresult,
  ldaID,
  select = NULL,
  tnames = NULL,
  threshold = NULL,
  meta,
  unit = "quarter",
  xunit = "year",
  color = NULL,
  sort = TRUE,
  legend = NULL,
  legendLimit = 0,
  peak = 0,
  file
)
plotArea(
  ldaresult,
  ldaID,
  select = NULL,
  tnames = NULL,
  threshold = NULL,
  meta,
  unit = "quarter",
  xunit = "year",
  color = NULL,
  sort = TRUE,
  legend = NULL,
  legendLimit = 0,
  peak = 0,
  file
)

Arguments

`ldaresult`	LDA result object
`ldaID`	Character vector including IDs of the texts
`select`	Selects all topics if parameter is null. Otherwise vector of integers or topic label. Only topics belonging to that numbers, and labels respectively would be plotted.
`tnames`	Character vector of topic labels. It must have same length than number of topics in the model.
`threshold`	Numeric: Treshold between 0 and 1. Topics would only be used if at least one time unit exist with a topic proportion above the treshold
`meta`	The meta data for the texts or a date-string.
`unit`	Time unit for x-axis. Possible units are `"bimonth"`, `"quarter"`, `"season"`, `"halfyear"`, `"year"`, for more units see `round_date`
`xunit`	Time unit for tiks on the x-axis. For possible units see `round_date`
`color`	Color vector. Color vector would be replicated if the number of plotted topics is bigger than length of the vector.
`sort`	Logical: Should the topics be sorted by topic proportion?
`legend`	Position of legend. If `NULL` (default), no legend will be plotted
`legendLimit`	Numeric between 0 (default) and 1. Only Topics with proportions above this limit appear in the legend.
`peak`	Numeric between 0 (default) and 1. Label peaks above `peak`. For each Topic every area which are at least once above `peak` will e labeled. An area ends if the topic proportion is under 1 percent.
`file`	Character: File path if a pdf should be created

Details

This function is useful to visualize the volume of topics and to show trends over time.

Value

List of two matrices. rel contains the topic proportions over time, relcum contains the cumulated topic proportions

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
plotArea(ldaresult=LDAresult, ldaID=names(poliLDA), meta=politics$meta)

plotArea(ldaresult=LDAresult, ldaID=names(poliLDA), meta=politics$meta, select=c(1,3,5))

## End(Not run)
## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
plotArea(ldaresult=LDAresult, ldaID=names(poliLDA), meta=politics$meta)

plotArea(ldaresult=LDAresult, ldaID=names(poliLDA), meta=politics$meta, select=c(1,3,5))

## End(Not run)

Plotting Counts of specified Wordgroups over Time (relative to Corpus)

Description

Creates a plot of the counts/proportion of given wordgroups (wordlist) in the subcorpus. The counts/proportion can be calculated on document or word level - with an 'and' or 'or' link - and additionally can be normalised by a subcorporus, which could be specified by id.

Usage

plotFreq(
  object,
  id = names(object$text),
  type = c("docs", "words"),
  wordlist,
  link = c("and", "or"),
  wnames,
  ignore.case = FALSE,
  rel = FALSE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  both.lwd,
  both.lty,
  main,
  xlab,
  ylab,
  ylim,
  col,
  legend = "topright",
  natozero = TRUE,
  file,
  ...
)
plotFreq(
  object,
  id = names(object$text),
  type = c("docs", "words"),
  wordlist,
  link = c("and", "or"),
  wnames,
  ignore.case = FALSE,
  rel = FALSE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  both.lwd,
  both.lty,
  main,
  xlab,
  ylab,
  ylim,
  col,
  legend = "topright",
  natozero = TRUE,
  file,
  ...
)

Arguments

`object`	`textmeta` object with strictly tokenized `text` component (`character` vectors) - like a result of `cleanTexts`
`id`	`character` vector (default: `object$meta$id`) which IDs specify the subcorpus
`type`	`character` (default: `"docs"`) should counts/proportion of documents, where every `"docs"` or words `"words"` be plotted
`wordlist`	list of `character` vectors. Every list element is an 'or' link, every `character` string in a vector is linked by the argument `link`. If `wordlist` is only a `character` vector it will be coerced to a list of the same length as the vector (see `as.list`), so that the argument `link` has no effect. Each `character` vector as a list element represents one curve in the outcoming plot
`link`	`character` (default: `"and"`) should the (inner) `character` vectors of each list element be linked by an `"and"` or an `"or"`
`wnames`	`character` vector of same length as `wordlist` - labels for every group of 'and' linked words
`ignore.case`	`logical` (default: `FALSE`) option from `grepl`.
`rel`	`logical` (default: `FALSE`) should counts (`FALSE`) or proportion (`TRUE`) be plotted
`mark`	`logical` (default: `TRUE`) should years be marked by vertical lines
`unit`	`character` (default: `"month"`) to which unit should dates be floored. Other possible units are `"bimonth"`, `"quarter"`, `"season"`, `"halfyear"`, `"year"`, for more units see `round_date`
`curves`	`character` (default: `"exact"`) should `"exact"`, `"smooth"` curve or `"both"` be plotted
`smooth`	`numeric` (default: `0.05`) smoothing parameter which is handed over to `lowess` as `f`
`both.lwd`	graphical parameter for smoothed values if `curves = "both"`
`both.lty`	graphical parameter for smoothed values if `curves = "both"`
`main`	`character` graphical parameter
`xlab`	`character` graphical parameter
`ylab`	`character` graphical parameter
`ylim`	(default if `rel = TRUE`: `c(0, 1)`) graphical parameter
`col`	graphical parameter, could be a vector. If `curves = "both"` the function will for every wordgroup plot at first the exact and then the smoothed curve - this is important for your col order.
`legend`	`character` (default: "topright") value(s) to specify the legend coordinates. If "none" no legend is plotted.
`natozero`	`logical` (default: `TRUE`) should NAs be coerced to zeros. Only has effect if `rel = TRUE`.
`file`	`character` file path if a pdf should be created
`...`	additional graphical parameters

Value

A plot. Invisible: A dataframe with columns date and wnames - and additionally columns wnames_rel for rel = TRUE - with the counts (and proportion) of the given wordgroups.

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
plotFreq(poliClean, wordlist=c("obama", "bush"))

## End(Not run)
## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
plotFreq(poliClean, wordlist=c("obama", "bush"))

## End(Not run)

Plotting Topics over Time relative to Corpus

Description

Creates a pdf showing a heat map. For each topic, the heat map shows the deviation of its current share from its mean share. Shares can be calculated on corpus level or on subcorpus level concerning LDA vocabulary. Shares can be calculated in absolute deviation from the mean or relative to the mean of the topic to account for different topic strengths.

Usage

plotHeat(
  object,
  ldaresult,
  ldaID,
  select = 1:nrow(ldaresult$document_sums),
  tnames,
  norm = FALSE,
  file,
  unit = "year",
  date_breaks = 1,
  margins = c(5, 0),
  ...
)
plotHeat(
  object,
  ldaresult,
  ldaID,
  select = 1:nrow(ldaresult$document_sums),
  tnames,
  norm = FALSE,
  file,
  unit = "year",
  date_breaks = 1,
  margins = c(5, 0),
  ...
)

Arguments

`object`	`textmeta` object with strictly tokenized `text` component (calculation of proportion on document lengths) or `textmeta` object which contains only the `meta` component (calculation of proportion on count of words out of the LDA vocabulary in each document)
`ldaresult`	LDA result object.
`ldaID`	Character vector containing IDs of the texts.
`select`	Numeric vector containing the numbers of the topics to be plotted. Defaults to all topics.
`tnames`	Character vector with labels for the topics.
`norm`	Logical: Should the values be normalized by the mean topic share to account for differently sized topics (default: `FALSE`)?
`file`	Character vector containing the path and name for the pdf output file.
`unit`	Character: To which unit should dates be floored (default: `"year"`)? Other possible units are `"bimonth"`, `"quarter"`, `"season"`, `"halfyear"`, `"year"`, for more units see `round_date`
`date_breaks`	How many labels should be shown on the x axis (default: `1`)? If `data_breaks` is `5` every fifth label is drawn.
`margins`	See `heatmap`
`...`	Additional graphical parameters passed to `heatmap`, for example `distfun` or `hclustfun`. details The function is useful to search for peaks in the coverage of topics.

Value

A pdf. Invisible: A dataframe.

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
plotHeat(object=poliClean, ldaresult=LDAresult, ldaID=names(poliLDA))

## End(Not run)
## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
plotHeat(object=poliClean, ldaresult=LDAresult, ldaID=names(poliLDA))

## End(Not run)

Plots Counts of Documents or Words over Time (relative to Corpus)

Description

Creates a plot of the counts/proportion of documents/words in the subcorpus, which could be specified by id.

Usage

plotScot(
  object,
  id = object$meta$id,
  type = c("docs", "words"),
  rel = FALSE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  main,
  xlab,
  ylab,
  ylim,
  both.lwd,
  both.col,
  both.lty,
  natozero = TRUE,
  file,
  ...
)
plotScot(
  object,
  id = object$meta$id,
  type = c("docs", "words"),
  rel = FALSE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  main,
  xlab,
  ylab,
  ylim,
  both.lwd,
  both.col,
  both.lty,
  natozero = TRUE,
  file,
  ...
)

Arguments

`object`	`textmeta` object with strictly tokenized `text` component vectors if `type = "words"`
`id`	Character: Vector (default: `object$meta$id`) which IDs specify the subcorpus
`type`	Character: Should counts/proportion of documents `"docs"` (default) or words `"words"` be plotted?
`rel`	Logical: Should counts (default: `FALSE`) or proportion (`TRUE`) be plotted?
`mark`	Logical: Should years be marked by vertical lines (default: `TRUE`)?
`unit`	Character: To which unit should dates be floored (default: `"month"`). Other possible units are `"bimonth"`, `"quarter"`, `"season"`, `"halfyear"`, `"year"`, for more units see `round_date`.
`curves`	Character: Should `"exact"`, `"smooth"` curve or `"both"` be plotted (default: `"exact"`)?
`smooth`	Numeric: Smoothing parameter which is handed over to `lowess` as `f` (default: `0.05`).
`main`	Character: Graphical parameter
`xlab`	Character: Graphical parameter
`ylab`	Character: Graphical parameter
`ylim`	Graphical parameter (default if `rel = TRUE`: `c(0, 1)`)
`both.lwd`	Graphical parameter for smoothed values if `curves = "both"`
`both.col`	Graphical parameter for smoothed values if `curves = "both"`
`both.lty`	Graphical parameter for smoothed values if `curves = "both"`
`natozero`	Logical: Should NAs be coerced to zeros (default: `TRUE`)? Only has an effect if `rel = TRUE`.
`file`	Character: File path if a pdf should be created.
`...`	additional graphical parameters

Details

object needs a textmeta object with strictly tokenized text component (character vectors) if you use type = "words". If you use type = "docs" you can use a tokenized or a non-tokenized text component. In fact, you can use the textmeta constructor (textmeta(meta = <your-meta-data.frame>)) to create a textmeta object containing only the meta field and plot the resulting object. This way you can save time and memory at the first glance.

Value

A plot Invisible: A dataframe with columns date and counts, respectively proportion

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)

# complete corpus
plotScot(object=poliClean)

# subcorpus
subID <- filterWord(poliClean, search=c("bush", "obama"), out="bin")
plotScot(object=poliClean, id=names(subID)[subID], curves="both", smooth=0.3)

## End(Not run)
## Not run: 
data(politics)
poliClean <- cleanTexts(politics)

# complete corpus
plotScot(object=poliClean)

# subcorpus
subID <- filterWord(poliClean, search=c("bush", "obama"), out="bin")
plotScot(object=poliClean, id=names(subID)[subID], curves="both", smooth=0.3)

## End(Not run)

Plotting Counts of Topics over Time (Relative to Corpus)

Description

Creates a plot of the counts/proportion of specified topics of a result of LDAgen. There is an option to plot all curves in one plot or to create one plot for every curve (see pages). In addition the plots can be written to a pdf by setting file.

Usage

plotTopic(
  object,
  ldaresult,
  ldaID,
  select = 1:nrow(ldaresult$document_sums),
  tnames,
  rel = FALSE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  main,
  xlab,
  ylim,
  ylab,
  both.lwd,
  both.lty,
  col,
  legend = ifelse(pages, "onlyLast:topright", "topright"),
  pages = FALSE,
  natozero = TRUE,
  file,
  ...
)
plotTopic(
  object,
  ldaresult,
  ldaID,
  select = 1:nrow(ldaresult$document_sums),
  tnames,
  rel = FALSE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  main,
  xlab,
  ylim,
  ylab,
  both.lwd,
  both.lty,
  col,
  legend = ifelse(pages, "onlyLast:topright", "topright"),
  pages = FALSE,
  natozero = TRUE,
  file,
  ...
)

Arguments

`object`	`textmeta` object with strictly tokenized `text` component (character vectors) - such as a result of `cleanTexts`
`ldaresult`	The result of a function call `LDAgen`
`ldaID`	Character vector of IDs of the documents in `ldaresult`
`select`	Integer: Which topics of `ldaresult` should be plotted (default: all topics)?
`tnames`	Character vector of same length as `select` - labels for the topics (default are the first returned words of `top.topic.words` from the `lda` package for each topic)
`rel`	Logical: Should counts (`FALSE`) or proportion (`TRUE`) be plotted (default: `FALSE`)?
`mark`	Logical: Should years be marked by vertical lines (default: `TRUE`)?
`unit`	Character: To which unit should dates be floored (default: `"month"`)? Other possible units are `"bimonth"`, `"quarter"`, `"season"`, `"halfyear"`, `"year"`, for more units see `round_date`
`curves`	Character: Should `"exact"`, `"smooth"` curve or `"both"` be plotted (default: `"exact"`)?
`smooth`	Numeric: Smoothing parameter which is handed over to `lowess` as `f` (default: `0.05`)
`main`	Character: Graphical parameter
`xlab`	Character: Graphical parameter
`ylim`	Graphical parameter
`ylab`	Character: Graphical parameter
`both.lwd`	Graphical parameter for smoothed values if `curves = "both"`
`both.lty`	Graphical parameter for smoothed values if `curves = "both"`
`col`	Graphical parameter, could be a vector. If `curves = "both"` the function will for every topicgroup plot at first the exact and then the smoothed curve - this is important for your col order.
`legend`	Character: Value(s) to specify the legend coordinates (default: `"topright"`, `"onlyLast:topright"` for `pages = TRUE` respectively). If "none" no legend is plotted.
`pages`	Logical: Should all curves be plotted in a single plot (default: `FALSE`)? In addtion you could set `legend = "onlyLast:<argument>"` with `<argument>` as a `character` `legend` argument for only plotting a legend on the last plot of set.
`natozero`	Logical: Should NAs be coerced to zeros (default: `TRUE`)? Only has effect if `rel = TRUE`.
`file`	Character: File path if a pdf should be created
`...`	Additional graphical parameters

Value

A plot. Invisible: A dataframe with columns date and tnames with the counts/proportion of the selected topics.

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)

# plot all topics
plotTopic(object=poliClean, ldaresult=LDAresult, ldaID=names(poliLDA))

# plot special topics
plotTopic(object=poliClean, ldaresult=LDAresult, ldaID=names(poliLDA), select=c(1,4))

## End(Not run)
## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)

# plot all topics
plotTopic(object=poliClean, ldaresult=LDAresult, ldaID=names(poliLDA))

# plot special topics
plotTopic(object=poliClean, ldaresult=LDAresult, ldaID=names(poliLDA), select=c(1,4))

## End(Not run)

Plotting Counts of Topics-Words-Combination over Time (Relative to Words)

Description

Creates a plot of the counts/proportion of specified combination of topics and words. It is important to keep in mind that the baseline for proportions are the sums of words, not sums of topics. See also plotWordpt. There is an option to plot all curves in one plot or to create one plot for every curve (see pages). In addition the plots can be written to a pdf by setting file.

Usage

plotTopicWord(
  object,
  docs,
  ldaresult,
  ldaID,
  wordlist = lda::top.topic.words(ldaresult$topics, 1),
  link = c("and", "or"),
  select = 1:nrow(ldaresult$document_sums),
  tnames,
  wnames,
  rel = FALSE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  legend = ifelse(pages, "onlyLast:topright", "topright"),
  pages = FALSE,
  natozero = TRUE,
  file,
  main,
  xlab,
  ylab,
  ylim,
  both.lwd,
  both.lty,
  col,
  ...
)
plotTopicWord(
  object,
  docs,
  ldaresult,
  ldaID,
  wordlist = lda::top.topic.words(ldaresult$topics, 1),
  link = c("and", "or"),
  select = 1:nrow(ldaresult$document_sums),
  tnames,
  wnames,
  rel = FALSE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  legend = ifelse(pages, "onlyLast:topright", "topright"),
  pages = FALSE,
  natozero = TRUE,
  file,
  main,
  xlab,
  ylab,
  ylim,
  both.lwd,
  both.lty,
  col,
  ...
)

Arguments

`object`	`textmeta` object with strictly tokenized `text` component (Character vectors) - such as a result of `cleanTexts`
`docs`	Object as a result of `LDAprep` which was handed over to `LDAgen`
`ldaresult`	The result of a function call `LDAgen` with `docs` as argument
`ldaID`	Character vector of IDs of the documents in `ldaresult`
`wordlist`	List of Ccharacter vectors. Every list element is an 'or' link, every character string in a vector is linked by the argument `link`. If `wordlist` is only a character vector it will be coerced to a list of the same length as the vector (see `as.list`), so that the argument `link` has no effect. Each character vector as a list element represents one curve in the emerging plot.
`link`	Character: Should the (inner) character vectors of each list element be linked by an `"and"` or an `"or"` (default: `"and"`)?
`select`	List of integer vectors: Which topics - linked by an "or" every time - should be take into account for plotting the word counts/proportion (default: all topics as simple integer vector)?
`tnames`	Character vector of same length as `select` - labels for the topics (default are the first returned words of
`wnames`	Character vector of same length as `wordlist` - labels for every group of 'and' linked words `top.topic.words` from the `lda` package for each topic)
`rel`	Logical: Should counts (`FALSE`) or proportion (`TRUE`) be plotted (default: `FALSE`)?
`mark`	Logical: Should years be marked by vertical lines (default: `TRUE`)?
`unit`	Character: To which unit should dates be floored (default: `"month"`)? Other possible units are `"bimonth"`, `"quarter"`, `"season"`, `"halfyear"`, `"year"`, for more units see `round_date`
`curves`	Character: Should `"exact"`, `"smooth"` curve or `"both"` be plotted (default: `"exact"`)?
`smooth`	Numeric: Smoothing parameter which is handed over to `lowess` as `f` (default: `0.05`)
`legend`	Character: Value(s) to specify the legend coordinates (default: `"topright"`, `"onlyLast:topright"` for `pages = TRUE` respectively). If "none" no legend is plotted.
`pages`	Logical: Should all curves be plotted in a single plot (default: `FALSE`)? In addition you could set `legend = "onlyLast:<argument>"` with `<argument>` as a character `legend` argument for only plotting a legend on the last plot of set.
`natozero`	Logical: Should NAs be coerced to zeros (default: `TRUE`)?
`file`	Character: File path if a pdf should be created
`main`	Character: Graphical parameter
`xlab`	Character: Graphical parameter
`ylab`	Character: Graphical parameter
`ylim`	Graphical parameter
`both.lwd`	Graphical parameter for smoothed values if `curves = "both"`
`both.lty`	Graphical parameter for smoothed values if `curves = "both"`
`col`	Graphical parameter, could be a vector. If `curves = "both"` the function will for every wordgroup plot at first the exact and then the smoothed curve - this is important for your col order.
`...`	Additional graphical parameters

Value

A plot. Invisible: A dataframe with columns date and tnames: wnames with the counts/proportion of the selected combination of topics and words.

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)

# plot topwords from each topic
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA))
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA), rel=TRUE)

# plot one word in different topics
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
              select=c(1,3,8), wordlist=c("bush"))

# Differences between plotTopicWord and plotWordpt
par(mfrow=c(2,2))
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
              select=c(1,3,8), wordlist=c("bush"), rel=FALSE)
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
           select=c(1,3,8), wordlist=c("bush"), rel=FALSE)
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
              select=c(1,3,8), wordlist=c("bush"), rel=TRUE)
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
           select=c(1,3,8), wordlist=c("bush"), rel=TRUE)

## End(Not run)
## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)

# plot topwords from each topic
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA))
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA), rel=TRUE)

# plot one word in different topics
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
              select=c(1,3,8), wordlist=c("bush"))

# Differences between plotTopicWord and plotWordpt
par(mfrow=c(2,2))
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
              select=c(1,3,8), wordlist=c("bush"), rel=FALSE)
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
           select=c(1,3,8), wordlist=c("bush"), rel=FALSE)
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
              select=c(1,3,8), wordlist=c("bush"), rel=TRUE)
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
           select=c(1,3,8), wordlist=c("bush"), rel=TRUE)

## End(Not run)

Plots Counts of Topics-Words-Combination over Time (Relative to Topics)

Description

Creates a plot of the counts/proportion of specified combination of topics and words. The plot shows how often a word appears in a topic. It is important to keep in mind that the baseline for proportions are the sums of topics, not sums of words. See also plotTopicWord. There is an option to plot all curves in one plot or to create one plot for every curve (see pages). In addition the plots can be written to a pdf by setting file.

Usage

plotWordpt(
  object,
  docs,
  ldaresult,
  ldaID,
  select = 1:nrow(ldaresult$document_sums),
  link = c("and", "or"),
  wordlist = lda::top.topic.words(ldaresult$topics, 1),
  tnames,
  wnames,
  rel = FALSE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  legend = ifelse(pages, "onlyLast:topright", "topright"),
  pages = FALSE,
  natozero = TRUE,
  file,
  main,
  xlab,
  ylab,
  ylim,
  both.lwd,
  both.lty,
  col,
  ...
)
plotWordpt(
  object,
  docs,
  ldaresult,
  ldaID,
  select = 1:nrow(ldaresult$document_sums),
  link = c("and", "or"),
  wordlist = lda::top.topic.words(ldaresult$topics, 1),
  tnames,
  wnames,
  rel = FALSE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  legend = ifelse(pages, "onlyLast:topright", "topright"),
  pages = FALSE,
  natozero = TRUE,
  file,
  main,
  xlab,
  ylab,
  ylim,
  both.lwd,
  both.lty,
  col,
  ...
)

Arguments

`object`	`textmeta` object with strictly tokenized `text` component (character vectors) - e.g. a result of `cleanTexts`
`docs`	Object as a result of `LDAprep` which was handed over to `LDAgen`
`ldaresult`	The result of a function call `LDAgen` with `docs` as argument
`ldaID`	Character vector of IDs of the documents in `ldaresult`
`select`	List of integer vectors. Every list element is an 'or' link, every integer string in a vector is linked by the argument `link`. If `select` is only a `integer` vector it will be coerced to a list of the same length as the vector (see `as.list`), so that the argument `link` has no effect. Each integer vector as a list element represents one curve in the outcoming plot
`link`	Character: Should the (inner) integer vectors of each list element be linked by an `"and"` or an `"or"` (default: `"and"`)?
`wordlist`	List of character vectors: Which words - always linked by an "or" - should be taken into account for plotting the topic counts/proportion (default: the first `top.topic.words` per topic as simple character vector)?
`tnames`	Character vector of same length as `select` - labels for the topics (default are the first returned words of
`wnames`	Character vector of same length as `wordlist` - labels for every group of 'and' linked words `top.topic.words` from the `lda` package for each topic)
`rel`	Logical: Should counts (`FALSE`) or proportion (`TRUE`) be plotted (default: `FALSE`)?
`mark`	Logical: Should years be marked by vertical lines (default: `TRUE`)?
`unit`	Character: To which unit should dates be floored (default: `"month"`)? Other possible units are `"bimonth"`, `"quarter"`, `"season"`, `"halfyear"`, `"year"`, for more units see `round_date`
`curves`	Character: Should `"exact"`, `"smooth"` curve or `"both"` be plotted (default: `"exact"`)?
`smooth`	Numeric: Smoothing parameter which is handed over to `lowess` as `f` (default: `0.05`)
`legend`	Character: Value(s) to specify the legend coordinates (default: `"topright"`, `"onlyLast:topright"` for `pages = TRUE` respectively). If "none" no legend is plotted.
`pages`	Logical: Should all curves be plotted in a single plot (default: `FALSE`)? In addtion you could set `legend = "onlyLast:<argument>"` with `<argument>` as a character `legend` argument for only plotting a legend on the last plot of set.
`natozero`	Logical: Should NAs be coerced to zeros (default: `TRUE`)?
`file`	Character: File path if a pdf should be created
`main`	Character: Graphical parameter
`xlab`	Ccharacter: Graphical parameter
`ylab`	Character: Graphical parameter
`ylim`	Graphical parameter
`both.lwd`	Graphical parameter for smoothed values if `curves = "both"`
`both.lty`	Graphical parameter for smoothed values if `curves = "both"`
`col`	Graphical parameter, could be a vector. If `curves = "both"` the function will plot for every wordgroup the exact at first and then the smoothed curve - this is important for your col order.
`...`	Additional graphical parameters

Value

A plot. Invisible: A dataframe with columns date and tnames: wnames with the counts/proportion of the selected combination of topics and words.

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA))
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA), rel=TRUE)

# Differences between plotTopicWord and plotWordpt
par(mfrow=c(2,2))
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
              select=c(1,3,8), wordlist=c("bush"), rel=FALSE)
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
           select=c(1,3,8), wordlist=c("bush"), rel=FALSE)
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
              select=c(1,3,8), wordlist=c("bush"), rel=TRUE)
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
           select=c(1,3,8), wordlist=c("bush"), rel=TRUE)

## End(Not run)
## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA))
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA), rel=TRUE)

# Differences between plotTopicWord and plotWordpt
par(mfrow=c(2,2))
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
              select=c(1,3,8), wordlist=c("bush"), rel=FALSE)
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
           select=c(1,3,8), wordlist=c("bush"), rel=FALSE)
plotTopicWord(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
              select=c(1,3,8), wordlist=c("bush"), rel=TRUE)
plotWordpt(object=poliClean, docs=poliLDA, ldaresult=LDAresult, ldaID=names(poliLDA),
           select=c(1,3,8), wordlist=c("bush"), rel=TRUE)

## End(Not run)

Plotting Counts/Proportion of Words/Docs in LDA-generated Topic-Subcorpora over Time

Description

Creates a plot of the counts/proportion of words/docs in corpora which are generated by a ldaresult. Therefore an article is allocated to a topic - and then to the topics corpus - if there are enough (see limit and alloc) allocations of words in the article to the corresponding topic. Additionally the corpora are reduced by filterWord and a search-argument. The plot shows counts of subcorpora or if rel = TRUE proportion of subcorpora to its corresponding whole corpus.

Usage

plotWordSub(
  object,
  ldaresult,
  ldaID,
  limit = 10,
  alloc = c("multi", "unique", "best"),
  select = 1:nrow(ldaresult$document_sums),
  tnames,
  search,
  ignore.case = TRUE,
  type = c("docs", "words"),
  rel = TRUE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  main,
  xlab,
  ylab,
  ylim,
  both.lwd,
  both.lty,
  col,
  legend = "topright",
  natozero = TRUE,
  file,
  ...
)
plotWordSub(
  object,
  ldaresult,
  ldaID,
  limit = 10,
  alloc = c("multi", "unique", "best"),
  select = 1:nrow(ldaresult$document_sums),
  tnames,
  search,
  ignore.case = TRUE,
  type = c("docs", "words"),
  rel = TRUE,
  mark = TRUE,
  unit = "month",
  curves = c("exact", "smooth", "both"),
  smooth = 0.05,
  main,
  xlab,
  ylab,
  ylim,
  both.lwd,
  both.lty,
  col,
  legend = "topright",
  natozero = TRUE,
  file,
  ...
)

Arguments

`object`	`textmeta` object with strictly tokenized `text` component (character vectors) - such as a result of `cleanTexts`
`ldaresult`	The result of a function call `LDAgen`
`ldaID`	Character vector of IDs of the documents in `ldaresult`
`limit`	Integer/numeric: How often a word must be allocated to a topic to count these article as belonging to this topic - if `0<limit<1` proportion is used (default: `10`)?
`alloc`	Character: Should every article be allocated to multiple topics (`"multi"`), or maximum one topic (`"unique"`), or the most represantative - exactly one - topic (`"best"`) (default: `"multi"`)? If `alloc = "best"` `limit` has no effect.
`select`	Integer vector: Which topics of `ldaresult` should be plotted (default: all topics)?
`tnames`	Character vector of same length as `select` - labels for the topics (default are the first returned words of `top.topic.words` from the `lda` package for each topic)
`search`	See `filterWord`
`ignore.case`	See `filterWord`
`type`	Character: Should counts/proportion of documents, where every `"docs"` or words `"words"` be plotted (default: `"docs"`)?
`rel`	Logical. Should counts (`FALSE`) or proportion (`TRUE`) be plotted (default: `TRUE`)?
`mark`	Logical: Should years be marked by vertical lines (default: `TRUE`)?
`unit`	Character: To which unit should dates be floored (default: `"month"`)? Other possible units are "bimonth", "quarter", "season", "halfyear", "year", for more units see `round_date`
`curves`	Character: Should `"exact"`, `"smooth"` curve or `"both"` be plotted (default: `"exact"`)?
`smooth`	Numeric: Smoothing parameter which is handed over to `lowess` as `f` (default: `0.05`)
`main`	Character: Graphical parameter
`xlab`	Character: Graphical parameter
`ylab`	Character: Graphical parameter
`ylim`	Graphical parameter (default if `rel = TRUE`: `c(0, 1)`)
`both.lwd`	Graphical parameter for smoothed values if `curves = "both"`
`both.lty`	Graphical parameter for smoothed values if `curves = "both"`
`col`	Graphical parameter, could be a vector. If `curves = "both"` the function will for every wordgroup plot at first the exact and then the smoothed curve - this is important for your col order.
`legend`	Character: Value(s) to specify the legend coordinates (default: "topright"). If "none" no legend is plotted.
`natozero`	Logical. Should NAs be coerced to zeros (default: `TRUE`)? Only has effect if `rel = TRUE`.
`file`	Character: File path if a pdf should be created
`...`	Additional graphical parameters

Value

A plot. Invisible: A dataframe with columns date and tnames with the counts/proportion of the selected topics.

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
poliPraesidents <- filterWord(object=poliClean, search=c("bush", "obama"))
words10 <- makeWordlist(text=poliPraesidents$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliPraesidents$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=5, vocab=words10)
plotWordSub(object=poliClean, ldaresult=LDAresult, ldaID=names(poliLDA), search="obama")

## End(Not run)
## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
poliPraesidents <- filterWord(object=poliClean, search=c("bush", "obama"))
words10 <- makeWordlist(text=poliPraesidents$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliPraesidents$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=5, vocab=words10)
plotWordSub(object=poliClean, ldaresult=LDAresult, ldaID=names(poliLDA), search="obama")

## End(Not run)

Precision and Recall

Description

Estimates Precision and Recall for sampling in different intersections

Usage

precision(w, p, subset)

vprecision(w, p, subset, n)

recall(w, p, subset)

vrecall(w, p, subset, n)
precision(w, p, subset)

vprecision(w, p, subset, n)

recall(w, p, subset)

vrecall(w, p, subset, n)

Arguments

`w`	Numeric vector: Each entry represents one intersection. Proportion of texts in this intersection.
`p`	Numeric vector: Each entry represents one intersection. Proportion of relevant texts in this intersection.
`subset`	Logical vector: Each entry represents one intersection. Controls if the intersection belongs to the subcorpus of interest or not.
`n`	Integer vector: Number of Texts labeled in the corresponding intersection.

Value

Estimator for precision, recall, and their variances respectively.

Examples

w <- c(0.5, 0.1, 0.2, 0.2)
p <- c(0.01, 0.8, 0.75, 0.95)
subset <- c(FALSE, TRUE, FALSE, TRUE)
n <- c(40, 20, 15, 33)
precision(w, p, subset)
vprecision(w, p, subset, n)
recall(w, p, subset)
vrecall(w, p, subset, n)

w <- c(0.5, 0.1, 0.2, 0.2)
p <- c(0.01, 0.8, 0.75, 0.95)
subset <- c(FALSE, TRUE, FALSE, TRUE)
n <- c(40, 20, 15, 33)
precision(w, p, subset)
vprecision(w, p, subset, n)
recall(w, p, subset)
vrecall(w, p, subset, n)

Read Corpora as CSV

Description

Reads CSV-files and seperates the text and meta data. The result is a textmeta object.

Usage

readTextmeta(
  path,
  file,
  cols,
  dateFormat = "%Y-%m-%d",
  idCol = "id",
  dateCol = "date",
  titleCol = "title",
  textCol = "text",
  encoding = "UTF-8",
  xmlAction = TRUE,
  duplicateAction = TRUE
)

readTextmeta.df(
  df,
  cols = colnames(df),
  dateFormat = "%Y-%m-%d",
  idCol = "id",
  dateCol = "date",
  titleCol = "title",
  textCol = "text",
  xmlAction = TRUE,
  duplicateAction = TRUE
)
readTextmeta(
  path,
  file,
  cols,
  dateFormat = "%Y-%m-%d",
  idCol = "id",
  dateCol = "date",
  titleCol = "title",
  textCol = "text",
  encoding = "UTF-8",
  xmlAction = TRUE,
  duplicateAction = TRUE
)

readTextmeta.df(
  df,
  cols = colnames(df),
  dateFormat = "%Y-%m-%d",
  idCol = "id",
  dateCol = "date",
  titleCol = "title",
  textCol = "text",
  xmlAction = TRUE,
  duplicateAction = TRUE
)

Arguments

`path`	`character/data.frame` string with path where the data files are OR parameter `df` for `readTextmeta.df`
`file`	`character` string with names of the CSV files
`cols`	`character` vector with columns which should be kept
`dateFormat`	`character` string with the date format in the files for `as.Date`
`idCol`	`character` string with column name of the IDs
`dateCol`	`character` string with column name of the Dates
`titleCol`	`character` string with column name of the Titles
`textCol`	`character` string with column name of the Texts
`encoding`	character string with encoding specification of the files
`xmlAction`	`logical` whether all columns of the CSV should be handled with `removeXML`
`duplicateAction`	`logical` whether `deleteAndRenameDuplicates` should be applied to the created `textmeta` object
`df`	`data.frame` table which should be transformed to a textmeta object

Value

textmeta object

Read WhatsApp files

Description

Reads HTML-files from WhatsApp and separates the text and meta data.

Usage

readWhatsApp(path, file)
readWhatsApp(path, file)

Arguments

`path`	Character: string with path where the data files are. If only `path` is given, `file` will be determined by searching for html files with `list.files` and recursion.
`file`	Character: string with names of the HTML files.

Value

textmeta object.

Author(s)

Jonas Rieger (<[email protected]>)

Read Pages from Wikipedia

Description

Downloads pages from Wikipedia and extracts some meta information with functions from the package WikipediR. Creates a textmeta object including the requested pages.

Usage

readWiki(
  category,
  subcategories = TRUE,
  language = "en",
  project = "wikipedia"
)
readWiki(
  category,
  subcategories = TRUE,
  language = "en",
  project = "wikipedia"
)

Arguments

`category`	`character` articles of which category should be downloaded, see `pages_in_category`, argument `categories`
`subcategories`	`logical` (default: `TRUE`) should subcategories be downloaded as well
`language`	`character` (default: `"en"`), see `pages_in_category`
`project`	`character` (default: `"wikipedia"`), see `pages_in_category`

Value

textmeta object

Examples

## Not run: corpus <- readWiki(category="Person_(Studentenbewegung)",
subcategories = FALSE, language = "de", project = "wikipedia")
## End(Not run)
## Not run: corpus <- readWiki(category="Person_(Studentenbewegung)",
subcategories = FALSE, language = "de", project = "wikipedia")
## End(Not run)

Read files from Wikinews

Description

Reads the XML-files from the Wikinews export page https://en.wikinews.org/wiki/Special:Export.

Usage

readWikinews(
  path = getwd(),
  file = list.files(path = path, pattern = "*.xml$", full.names = FALSE, recursive =
    TRUE)
)
readWikinews(
  path = getwd(),
  file = list.files(path = path, pattern = "*.xml$", full.names = FALSE, recursive =
    TRUE)
)

Arguments

`path`	Path where the data files are.
`file`	Character string with names of the HTML files.

Value

textmeta-object

Removes XML/HTML Tags and Umlauts

Description

Removes XML tags (removeXML), remove or resolve HTML tags (removeHTML) and changes german umlauts in a standardized form (removeUmlauts).

Usage

removeXML(x)

removeUmlauts(x)

removeHTML(
  x,
  dec = TRUE,
  hex = TRUE,
  entity = TRUE,
  symbolList = c(1:4, 9, 13, 15, 16),
  delete = TRUE,
  symbols = FALSE
)
removeXML(x)

removeUmlauts(x)

removeHTML(
  x,
  dec = TRUE,
  hex = TRUE,
  entity = TRUE,
  symbolList = c(1:4, 9, 13, 15, 16),
  delete = TRUE,
  symbols = FALSE
)

Arguments

`x`	Character: Vector or list of character vectors.
`dec`	Logical: If `TRUE` HTML-entities in decimal-style would be resolved.
`hex`	Logical: If `TRUE` HTML-entities in hexadecimal-style would be resolved.
`entity`	Logical: If `TRUE` HTML-entities in text-style would be resolved.
`symbolList`	numeric vector to chhose from the 16 ISO-8859 Lists (ISO-8859 12 did not exists and is empty).
`delete`	Logical: If `TRUE` all not resolved HTML-entities would bei deleted?
`symbols`	Logical: If `TRUE` most symbols from ISO-8859 would be not resolved (DEC: 32:64, 91:96, 123:126, 160:191, 215, 247, 818, 8194:8222, 8254, 8291, 8364, 8417, 8470).

Details

The decision which u.type is used should consider the language of the corpus, because in some languages the replacement of umlauts can change the meaning of a word. To change which columns are used by removeXML use argument xmlAction in readTextmeta.

Value

Adjusted character string or list, depending on input.

Examples

xml <- "<text>Some <b>important</b> text</text>"
removeXML(xml)

x <- "&#x00f8; &#248; &oslash;"
removeHTML(x=x, symbolList = 1, dec=TRUE, hex=FALSE, entity=FALSE, delete = FALSE)
removeHTML(x=x, symbolList = c(1,3))

y <- c("Bl\UFChende Apfelb\UE4ume")
removeUmlauts(y)

xml <- "<text>Some <b>important</b> text</text>"
removeXML(xml)

x <- "&#x00f8; &#248; &oslash;"
removeHTML(x=x, symbolList = 1, dec=TRUE, hex=FALSE, entity=FALSE, delete = FALSE)
removeHTML(x=x, symbolList = c(1,3))

y <- c("Bl\UFChende Apfelb\UE4ume")
removeUmlauts(y)

Sample Texts

Description

Sample texts from different subsets to minimize variance of the recall estimator

Usage

sampling(id, corporaID, label, m, randomize = FALSE, exact = FALSE)
sampling(id, corporaID, label, m, randomize = FALSE, exact = FALSE)

Arguments

`id`	Character: IDs of all texts in the corpus.
`corporaID`	List of Character: Each list element is a character vector and contains the IDs belonging to one subcorpus. Each ID has to be in `id`.
`label`	Named Logical: Labeling result for already labeled texts. Could be empty, if no labeled data exists. The algorithm sets `p = 0.5` for all intersections. Names have to be `id`.
`m`	Integer: Number of new samples.
`randomize`	Logical: If `TRUE` calculated split is used as parameter to draw from a multinomial distribution.
`exact`	Logical: If `TRUE` exact calculation is used. For the default `FALSE` an approximation is used.

Value

Character vector of IDs, which should be labeled next.

Examples

id <- paste0("ID", 1:1000)
corporaID <- list(sample(id, 300), sample(id, 100), sample(id, 700))
label <- sample(as.logical(0:1), 150, replace=TRUE)
names(label) <- c(sample(id, 100), sample(corporaID[[2]], 50))
m <- 100
sampling(id, corporaID, label, m)
id <- paste0("ID", 1:1000)
corporaID <- list(sample(id, 300), sample(id, 100), sample(id, 700))
label <- sample(as.logical(0:1), 150, replace=TRUE)
names(label) <- c(sample(id, 100), sample(corporaID[[2]], 50))
m <- 100
sampling(id, corporaID, label, m)

Export Readable Meta-Data of Articles.

Description

Exports requested meta-data of articles for given id's.

Usage

showMeta(
  meta,
  id = meta$id,
  cols = colnames(meta),
  file,
  fileEncoding = "UTF-8"
)
showMeta(
  meta,
  id = meta$id,
  cols = colnames(meta),
  file,
  fileEncoding = "UTF-8"
)

Arguments

`meta`	A data.frame of meta-data as a result of a read-function.
`id`	Character vector or matrix including article ids.
`cols`	Character vector including the requested columns of meta.
`file`	Character Filename for the export.
`fileEncoding`	character string: declares file encoding. For more information see `write.csv`

Value

A list of the requested meta data. If file is set, writes a csv including the meta-data of the requested meta data.

Examples

meta <- data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE)

extractedMeta <- showMeta(meta=meta, cols = c("title", "date"))

meta <- data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE)

extractedMeta <- showMeta(meta=meta, cols = c("title", "date"))

Exports Readable Text Lists

Description

Exports the article id, text, title and date.

Usage

showTexts(object, id = names(object$text), file, fileEncoding = "UTF-8")
showTexts(object, id = names(object$text), file, fileEncoding = "UTF-8")

Arguments

`object`	`textmeta` object
`id`	Character vector or matrix including article ids
`file`	Character Filename for the export. If not specified the functions output ist only invisible.
`fileEncoding`	character string: declares file encoding. For more information see `write.csv`

Value

A list of the requested articles. If file is set, writes a csv including the meta-data of the requested articles.

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

exportedTexts <- showTexts(object=corpus, id = c("A","C"))
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

exportedTexts <- showTexts(object=corpus, id = c("A","C"))

"textmeta"-Objects

Description

Creates, Tests, Summarises and Plots Textmeta-Objects

Usage

textmeta(meta = NULL, text = NULL, metamult = NULL, dateFormat = "%Y-%m-%d")

is.textmeta(x)

## S3 method for class 'textmeta'
print(x, ...)

## S3 method for class 'textmeta'
summary(object, listnames = names(object), metavariables = character(), ...)

## S3 method for class 'textmeta'
plot(x, ...)
textmeta(meta = NULL, text = NULL, metamult = NULL, dateFormat = "%Y-%m-%d")

is.textmeta(x)

## S3 method for class 'textmeta'
print(x, ...)

## S3 method for class 'textmeta'
summary(object, listnames = names(object), metavariables = character(), ...)

## S3 method for class 'textmeta'
plot(x, ...)

Arguments

`meta`	Data.frame (or matrix) of the meta-data, e.g. as received from `as.meta`
`text`	Named list (or character vector) of the text-data (names should correspond to IDs in meta)
`metamult`	List of the metamult-data
`dateFormat`	Charachter string with the date format in meta for `as.Date`
`x`	an R Object.
`...`	further arguments in plot. Not implemented for print and summary.
`object`	textmeta object
`listnames`	Character vector with names of textmeta lists (meta, text, metamult). Summaries are generated for those lists only. Default gives summaries for all lists.
`metavariables`	Character vector with variable-names from the meta dataset. Summaries are generated for those variables only.

Value

A textmeta object.

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

print(corpus)
summary(corpus)
str(corpus)

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

print(corpus)
summary(corpus)
str(corpus)

Transform textmeta to an object with tidy text data

Description

Transfers data from a text component of a textmeta object to a tidy data.frame.

Usage

tidy.textmeta(object)

is.textmeta_tidy(x)

## S3 method for class 'textmeta_tidy'
print(x, ...)
tidy.textmeta(object)

is.textmeta_tidy(x)

## S3 method for class 'textmeta_tidy'
print(x, ...)

Arguments

`object`	A `textmeta` object
`x`	an R Object.
`...`	further arguments passed to or from other methods.

Value

An object with tidy text data

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
 Teach a Man To Fish, and You Feed Him for a Lifetime",
 B="So Long, and Thanks for All the Fish",
 C="A very able manipulative mathematician, Fisher enjoys a real mastery
 in evaluating complicated multiple integrals.")

obj <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
 title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
 date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
 additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

tidy.textmeta(obj)

obj <- cleanTexts(obj)
tidy.textmeta(obj)
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
 Teach a Man To Fish, and You Feed Him for a Lifetime",
 B="So Long, and Thanks for All the Fish",
 C="A very able manipulative mathematician, Fisher enjoys a real mastery
 in evaluating complicated multiple integrals.")

obj <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
 title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
 date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
 additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

tidy.textmeta(obj)

obj <- cleanTexts(obj)
tidy.textmeta(obj)

Calculating Topic Coherence

Description

Implementationof Mimno's topic coherence.

Usage

topicCoherence(
  ldaresult,
  documents,
  num.words = 10,
  by.score = TRUE,
  sym.coherence = FALSE,
  epsilon = 1
)
topicCoherence(
  ldaresult,
  documents,
  num.words = 10,
  by.score = TRUE,
  sym.coherence = FALSE,
  epsilon = 1
)

Arguments

`ldaresult`	The result of a function call `LDAgen`
`documents`	A list prepared by `LDAprep`.
`num.words`	Integer: Number of topwords used for calculating topic coherence (default: `10`).
`by.score`	Logical: Should the Score from `top.topic.words` be used (default: `TRUE`)?
`sym.coherence`	Logical: Should a symmetric version of the topic coherence used for the calculations? If TRUE the denominator of the topic coherence uses both wordcounts and not just one.
`epsilon`	Numeric: Smoothing factor to avoid log(0). Default is 1. Stevens et al. recommend a smaller value.

Value

A vector of topic coherences. the length of the vector corresponds to the number of topics in the model.

References

Mimno, David and Wallach, Hannah M. and Talley, Edmund and Leenders, Miriam and McCallum, Andrew. Optimizing semantic coherence in topic models. EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011. Stevens, Keith and Andrzejewski, David and Buttler, David. Exploring topic coherence over many models and many topics. EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012.

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
ldaPrep <- LDAprep(text=corpus$text, vocab=wordlist$words)

result <- LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)
topicCoherence(ldaresult=result, documents=ldaPrep, num.words=5, by.score=TRUE)
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
ldaPrep <- LDAprep(text=corpus$text, vocab=wordlist$words)

result <- LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)
topicCoherence(ldaresult=result, documents=ldaPrep, num.words=5, by.score=TRUE)

Coloring the words of a text corresponding to topic allocation

Description

The function creates a HTML document with the words of texts colored depending on the topic allocation of each word.

Usage

topicsInText(
  text,
  ldaID,
  id,
  ldaresult,
  label = NULL,
  vocab,
  wordOrder = c("both", "alphabetical", "topics", ""),
  colors = NULL,
  fixColors = FALSE,
  meta = NULL,
  originaltext = NULL,
  unclearTopicAssignment = TRUE,
  htmlreturn = FALSE
)
topicsInText(
  text,
  ldaID,
  id,
  ldaresult,
  label = NULL,
  vocab,
  wordOrder = c("both", "alphabetical", "topics", ""),
  colors = NULL,
  fixColors = FALSE,
  meta = NULL,
  originaltext = NULL,
  unclearTopicAssignment = TRUE,
  htmlreturn = FALSE
)

Arguments

`text`	The result of `LDAprep`
`ldaID`	List of IDs for `text`
`id`	ID of the article of interest
`ldaresult`	A result object from the `standardLDA`
`label`	Optional label for each topic
`vocab`	Character: Vector of `vocab` corresponding to the `text` object
`wordOrder`	Type of output: `"alphabetical"` prints the words of the article in alphabetical order, `"topics"` sorts by topic (biggest topic first) and `"both"` prints both versions. All other inputs will result to no output (this makes only sense in combination with `originaltext`.
`colors`	Character vector of colors. If the vector is shorter than the number of topics it will be completed by "black" entrys.
`fixColors`	Logical: If `FALSE` the first color will be used for the biggest topic and so on. If `fixColors=TRUE` the the color-entry corresponding to the position of the topic is choosen.
`meta`	Optional input for meta data. It will be printed in the header of the output.
`originaltext`	Optional a list of texts (the `text` list of the `textmeta` object) including the desired text. Listnames must be IDs. Necessary for output in original text
`unclearTopicAssignment`	Logical: If TRUE all words which are assigned to more than one topic will not be colored. Otherwise the words will be colored in order of topic apperance in the `ldaresult`.
`htmlreturn`	Logical: HTML output for tests

Value

A HTML document

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
topicsInText(text=politics$text, ldaID=names(poliLDA), id="ID2756",
             ldaresult=LDAresult, vocab=words10)
## End(Not run)
## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
topicsInText(text=politics$text, ldaID=names(poliLDA), id="ID2756",
             ldaresult=LDAresult, vocab=words10)
## End(Not run)

Get The IDs Of The Most Representive Texts

Description

The function extracts the text IDs belonging to the texts with the highest relative or absolute number of words per topic.

Usage

topTexts(
  ldaresult,
  ldaID,
  limit = 20L,
  rel = TRUE,
  select = 1:nrow(ldaresult$document_sums),
  tnames,
  minlength = 30L
)
topTexts(
  ldaresult,
  ldaID,
  limit = 20L,
  rel = TRUE,
  select = 1:nrow(ldaresult$document_sums),
  tnames,
  minlength = 30L
)

Arguments

`ldaresult`	LDA result
`ldaID`	Vector of text IDs
`limit`	Integer: Number of text IDs per topic.
`rel`	Logical: Should be the relative frequency be used?
`select`	Which topics should be returned?
`tnames`	Names of the selected topics
`minlength`	Minimal total number of words a text must have to be included

Value

Matrix of text IDs.

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
ldaPrep <- LDAprep(text=corpus$text, vocab=wordlist$words)

LDA <- LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)
topTexts(ldaresult=LDA, ldaID=c("A","B","C"), limit = 1L, minlength=2)
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

corpus <- cleanTexts(corpus)
wordlist <- makeWordlist(corpus$text)
ldaPrep <- LDAprep(text=corpus$text, vocab=wordlist$words)

LDA <- LDAgen(documents=ldaPrep, K = 3L, vocab=wordlist$words, num.words=3)
topTexts(ldaresult=LDA, ldaID=c("A","B","C"), limit = 1L, minlength=2)

`topics`	`named matrix`: The counts of vocabularies (column wise) in topics (row wise).
`numWords`	`integer(1)`: The number of requested top words per topic.
`byScore`	`logical(1)`: Should the values that are taken for determining the top words per topic be calculated by the function `importance` (`TRUE`) or should the absolute counts be considered (`FALSE`)?
`epsilon`	`numeric(1)`: Small number to add to logarithmic calculations to overcome the issue of determining `log(0)`.
`values`	`logical(1)`: Should the values that are taken for determining the top words per topic be returned?