pyate package¶
Submodules¶
pyate.combo_basic module¶
pyate.cvalues module¶
pyate.term_extraction module¶
-
class
pyate.term_extraction.
TermExtraction
(corpus: Union[str, Iterable[str]], vocab: Sequence[str] = None, patterns=[[{'POS': 'ADJ', 'IS_PUNCT': False}], [{'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}], [{'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}, {'POS': 'DET', 'IS_PUNCT': False}, {'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}]], do_parallelize: bool = True, language='en', nlp=None, default_domain=None, default_domain_size: int = None, max_word_length: int = None, dtype: numpy.dtype = None)[source]¶ Bases:
object
-
DEFAULT_GENERAL_DOMAINS
= {}¶
-
adj
= {'IS_PUNCT': False, 'POS': 'ADJ'}¶
-
basic
(*args, **kwargs)¶
-
combo_basic
(*args, **kwargs)¶
-
config
= {'DEFAULT_GENERAL_DOMAIN_SIZE': 300, 'MAX_WORD_LENGTH': 6, 'dtype': <class 'numpy.int16'>, 'language': 'en', 'spacy_model': 'en_core_web_sm'}¶
-
static
configure
(new_settings: Dict[str, Any])[source]¶ Updates config settings, which include: - spacy_model: str = “en_core_web_sm” (for changing the spacy model name to be used), - “language”: str = “en” (this is the default language), - “MAX_WORD_LENGTH”: int = 6 (this is the maximum word length to be considered a phrase), - “DEFAULT_GENERAL_DOMAIN_SIZE”: int = 300 (this is the number of sentences to be taken from the general domain file), - “dtype”: np.int16 (this is the date type for max Pandas series int size which are used as counters),
-
count_terms_from_document
(document: str)[source]¶ Counts the frequency of each term in the class and returns it as a default dict mapping each string to the number of occurences of the phrase, for each phrase in vocab.
-
count_terms_from_documents
(seperate: bool = False, verbose: bool = False)[source]¶ This is the main purpose of this class. Counts terms from the documents and returns a pandas Series. If self.corpus is a string, then it is identical to count_terms_from_document. If the corpus is an iterable (more specifically, collections.abc.Iterable) of strings, then it will perform the same thing but for each string in the iterable. If seperate is set to true, then it will return an iterable of default dicts; otherwise, it will return a single default dict with the sum of frequencies among all strings.
-
cvalues
(*args, **kwargs)¶
-
static
get_general_domain
(language: str = None, size: int = None)[source]¶ For getting a pandas Series of domains.
-
nlps
= {}¶
-
noun
= {'IS_PUNCT': False, 'POS': 'NOUN'}¶
-
patterns
= [[{'POS': 'ADJ', 'IS_PUNCT': False}], [{'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}], [{'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}, {'POS': 'DET', 'IS_PUNCT': False}, {'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}]]¶
-
prep
= {'IS_PUNCT': False, 'POS': 'DET'}¶
-
static
set_language
(language: str, model_name: str = None)[source]¶ For changing the language. Currently, the DEFAULT_GENERAL_DOMAIN is still in English and Italian only. If you have a good dataset in another language please put it in an issue on Github.
-
term_extractor
(*args, **kwargs)¶
-
trie
¶ Returns an automaton using the Aho–Corasick algorithm using the pyachocorasick library (https://pypi.org/project/pyahocorasick/). This method builds the automaton the first time and caches it for future use.
-
weirdness
(*args, **kwargs)¶
-