pyate package¶

Submodules¶

pyate.basic module¶

pyate.basic.basic(technical_corpus: str, *args, **kwargs)[source]¶

pyate.combo_basic module¶

pyate.combo_basic.combo_basic(technical_corpus: Union[str, Sequence[str]], smoothing: float = 0.01, verbose: bool = False, have_single_word: bool = False, technical_counts: Mapping[str, int] = None, weights: Sequence[float] = None)[source]¶

pyate.combo_basic.helper_get_subsequences(s: str) → List[str][source]¶

pyate.cvalues module¶

pyate.cvalues.cvalues(technical_corpus: Union[str, Sequence[str]], smoothing: float = 0.01, verbose: bool = False, have_single_word: bool = False, technical_counts: Mapping[str, int] = None, threshold: float = 0)[source]¶

pyate.cvalues.helper_get_subsequences(s: str) → List[str][source]¶

pyate.term_extraction module¶

class pyate.term_extraction.TermExtraction(corpus: Union[str, Iterable[str]], vocab: Sequence[str] = None, patterns=[[{'POS': 'ADJ', 'IS_PUNCT': False}], [{'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}], [{'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}, {'POS': 'DET', 'IS_PUNCT': False}, {'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}]], do_parallelize: bool = True, language='en', nlp=None, default_domain=None, default_domain_size: int = None, max_word_length: int = None, dtype: numpy.dtype = None)[source]¶

Bases: object

DEFAULT_GENERAL_DOMAINS = {}¶

adj = {'IS_PUNCT': False, 'POS': 'ADJ'}¶

basic(*args, **kwargs)¶

static clear_resouces()[source]¶: Clears cached spaCy nlp’s and general domain documents.

combo_basic(*args, **kwargs)¶

config = {'DEFAULT_GENERAL_DOMAIN_SIZE': 300, 'MAX_WORD_LENGTH': 6, 'dtype': <class 'numpy.int16'>, 'language': 'en', 'spacy_model': 'en_core_web_sm'}¶

static configure(new_settings: Dict[str, Any])[source]¶: Updates config settings, which include: - spacy_model: str = “en_core_web_sm” (for changing the spacy model name to be used), - “language”: str = “en” (this is the default language), - “MAX_WORD_LENGTH”: int = 6 (this is the maximum word length to be considered a phrase), - “DEFAULT_GENERAL_DOMAIN_SIZE”: int = 300 (this is the number of sentences to be taken from the general domain file), - “dtype”: np.int16 (this is the date type for max Pandas series int size which are used as counters),

count_terms_from_document(document: str)[source]¶: Counts the frequency of each term in the class and returns it as a default dict mapping each string to the number of occurences of the phrase, for each phrase in vocab.

count_terms_from_documents(seperate: bool = False, verbose: bool = False)[source]¶: This is the main purpose of this class. Counts terms from the documents and returns a pandas Series. If self.corpus is a string, then it is identical to count_terms_from_document. If the corpus is an iterable (more specifically, collections.abc.Iterable) of strings, then it will perform the same thing but for each string in the iterable. If seperate is set to true, then it will return an iterable of default dicts; otherwise, it will return a single default dict with the sum of frequencies among all strings.

cvalues(*args, **kwargs)¶

static get_general_domain(language: str = None, size: int = None)[source]¶: For getting a pandas Series of domains.

static get_nlp(language: str = None)[source]¶: For getting the spaCy nlp.

nlps = {}¶

noun = {'IS_PUNCT': False, 'POS': 'NOUN'}¶

patterns = [[{'POS': 'ADJ', 'IS_PUNCT': False}], [{'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}], [{'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}, {'POS': 'DET', 'IS_PUNCT': False}, {'POS': {'IN': ['ADJ', 'NOUN']}, 'OP': '*', 'IS_PUNCT': False}, {'POS': 'NOUN', 'IS_PUNCT': False}]]¶

prep = {'IS_PUNCT': False, 'POS': 'DET'}¶

static set_language(language: str, model_name: str = None)[source]¶: For changing the language. Currently, the DEFAULT_GENERAL_DOMAIN is still in English and Italian only. If you have a good dataset in another language please put it in an issue on Github.

term_extractor(*args, **kwargs)¶

trie¶: Returns an automaton using the Aho–Corasick algorithm using the pyachocorasick library (https://pypi.org/project/pyahocorasick/). This method builds the automaton the first time and caches it for future use.

weirdness(*args, **kwargs)¶

static word_length(string: str)[source]¶: Utility function for quickly computing the number of words in a string.

pyate.term_extraction.add_term_extraction_method(extractor: Callable[[...], pandas.core.series.Series])[source]¶

pyate.term_extraction_pipeline module¶

class pyate.term_extraction_pipeline.TermExtractionPipeline(nlp, func: Callable[[...], pandas.core.series.Series] = <function combo_basic>, force: bool = True, *args, **kwargs)[source]¶