Package miopia :: Package preprocessor :: Module PreProcessor :: Class PreProcessor
[hide private]
[frames] | no frames]

Class PreProcessor

source code

                 object --+    
                          |    
PreProcessorI.PreProcessorI --+
                              |
                             PreProcessor

Tools for preprocessing a plain text

Instance Methods [hide private]
 
_prepare_regexps(self) source code
 
_convert_numbers(self, text)
Removes digit grouping and spaces currency symbols and codes
source code
 
_is_number(self, text) source code
 
__init__(self, composite_words={}, abbreviations={}, lang='es')
Constructor
source code
 
_get_composite_words_patterns(self, dict_composite_words) source code
 
_get_abbreviations_patterns(self, dict_abbreviations) source code
 
_get_special_abbreviations_patterns(self, dict_abbreviations) source code
 
_format_punkt(self, token)
Returns: A modified token with separated punkt, if is not a number, otherwise returns the token
source code
 
_format_composite_words(self, line)
Returns: A line where composite words are joined as one token
source code
 
_format_upper_abbreviations(self, line, a, abbr) source code
 
_format_abbreviations(self, line) source code
 
preprocess(self, text)
Returns: A string preprocessed
source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Class Variables [hide private]
  RE_CURRENCY_SYMBOL = '[\\$\xe2\x82\xac\xc2\xa3]'
  RE_CURRENCY_CODE = '[A-Z]{3}'
  decimal_mark = '\\.'
  digit_grouping = ','
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, composite_words={}, abbreviations={}, lang='es')
(Constructor)

source code 

Constructor

Parameters:
  • composite_words - A composite words dictionary in the format {OrinalWord:JoinedWord}
  • abbreviations - An abbreviations dictionary in the format {abbreviation:OriginalWord}
Overrides: object.__init__

_format_punkt(self, token)

source code 
Parameters:
  • token - A token
Returns:
A modified token with separated punkt, if is not a number, otherwise returns the token

_format_composite_words(self, line)

source code 
Parameters:
  • line - A line of a sentence
Returns:
A line where composite words are joined as one token

preprocess(self, text)

source code 
Parameters:
  • text - A string
Returns:
A string preprocessed
Overrides: PreProcessorI.PreProcessorI.preprocess