core.csv ======== .. py:module:: core.csv .. autoapi-nested-parse:: Offers tools to deal with csv (and xls, xlsx) files. Attributes ---------- .. autoapisummary:: core.csv._T core.csv._RowT core.csv.VALID_CSV_DELIMITERS core.csv.WHITESPACE core.csv.INVALID_XLSX_TITLE core.csv.small_chars core.csv.large_chars core.csv.max_width Classes ------- .. autoapisummary:: core.csv.CSVFile Functions --------- .. autoapisummary:: core.csv.detect_encoding core.csv.sniff_dialect core.csv.normalize_header core.csv.convert_xlsx_to_csv core.csv.convert_xls_to_csv core.csv.convert_excel_to_csv core.csv.character_width core.csv.estimate_width core.csv.get_keys_from_list_of_dicts core.csv.convert_list_of_dicts_to_csv core.csv.convert_list_of_dicts_to_xlsx core.csv.convert_list_of_list_of_dicts_to_xlsx core.csv.normalize_sheet_titles core.csv.avoid_duplicate_name core.csv.remove_first_word core.csv.has_duplicates core.csv.list_duplicates_index core.csv.parse_header core.csv.match_headers Module Contents --------------- .. py:data:: _T .. py:data:: _RowT .. py:data:: VALID_CSV_DELIMITERS :value: ',;\t' .. py:data:: WHITESPACE .. py:data:: INVALID_XLSX_TITLE .. py:data:: small_chars :value: 'fijlrt:,;.+i ' .. py:data:: large_chars :value: 'GHMWQ_' .. py:data:: max_width :value: 75 .. py:class:: CSVFile(csvfile: IO[bytes], expected_headers: collections.abc.Collection[str] | None = None, dialect: type[csv.Dialect] | csv.Dialect | str | None = None, encoding: str | None = None, rename_duplicate_column_names: bool = False, rowtype: None = None) CSVFile(csvfile: IO[bytes], expected_headers: collections.abc.Collection[str] | None = None, dialect: type[csv.Dialect] | csv.Dialect | str | None = None, encoding: str | None = None, rename_duplicate_column_names: bool = False, *, rowtype: _RowType[_RowT]) Bases: :py:obj:`Generic`\ [\ :py:obj:`_RowT`\ ] Provides access to a csv file. :param csvfile: The csv file to be accessed. Must be an open file (not a path), opened in binary mode. For example:: with open(path, 'rb') as f: csv = CSVFile(f) :param expected_headers: The expected headers if known. Expected headers are headers which *must* exist in the CSV file. There may be additional headers. If the headers are slightly misspelled, a matching algorithm tries to guess the correct header, without accidentally matching the wrong headers. See :func:`match_headers` for more information. If the no expected_headers are passed, no checks are done, but the headers are still available. Headers matching is useful if a user provides the CSV and it might be wrong. If it is impossible for misspellings to occurr, the expected headers don't have to be specified. :param dialect: The CSV dialect to expect. By default, the dialect will be guessed using Python's heuristic. :param encoding: The CSV encoding to expect. By default, the encoding will be guessed and will either be UTF-8 or CP1252. :param rename_duplicate_column_names: It is possible to rename duplicate column names to deal with super wacky files. If this option is set and a duplicate header is found, a suffix is appended to the column name rather than throwing a DuplicateColumnNamesError. :param rowtype: An alternative rowtype for the resulting rows. This should be a callable that receives a `rownumber` key/value and all the other keys/values found in the csv. The keys are normalized and are valid Python identifiers usable as attribute names. Defaults to a namedtuple created using the found headers. Once the csv file is open, the records can be acceessed as follows:: with open(path, 'rb') as f: csv = CSVFile(f) for line in csv.lines: csv.my_field # access the column with the 'my_field' header .. py:attribute:: rowtype :type: _RowType[_RowT] .. py:attribute:: csvfile .. py:attribute:: headers .. py:method:: as_valid_identifier(value: str) -> str :staticmethod: .. py:method:: __iter__() -> collections.abc.Iterator[_RowT] .. py:property:: lines :type: collections.abc.Iterator[_RowT] .. py:function:: detect_encoding(csvfile: IO[bytes]) -> str Since encoding detection is hard to get right (and work correctly every time), we limit ourselves here to UTF-8 or CP1252, whichever works first. CP1252 is basically the csv format you get if you use windows and excel and it is a superset of ISO-8859-1/LATIN1. .. py:function:: sniff_dialect(csv: str) -> type[csv.Dialect] Takes the given csv string and returns the dialect or raises an error. Works just like Python's built in sniffer, just that it is a bit more conservative and doesn't just accept any kind of character as csv delimiter. .. py:function:: normalize_header(header: str) -> str Normalizes a header value to be as uniform as possible. This includes: * stripping the whitespace around it * lowercasing everything * transliterating unicode (e.g. 'รค' becomes 'a') * removing duplicate whitespace inside it .. py:function:: convert_xlsx_to_csv(xlsx: IO[bytes], sheet_name: str | None = None) -> io.BytesIO Takes an XLS file and returns a csv file using the given worksheet name or the first worksheet found. .. py:function:: convert_xls_to_csv(xls: IO[bytes], sheet_name: str | None = None) -> io.BytesIO Takes an XLS file and returns a csv file using the given worksheet name or the first worksheet found. .. py:function:: convert_excel_to_csv(file: IO[bytes], sheet_name: str | None = None) -> io.BytesIO Takes an XLS/XLSX file and returns a csv file using the given worksheet name or the first worksheet found. .. py:function:: character_width(char: str) -> float .. py:function:: estimate_width(text: str) -> float .. py:function:: get_keys_from_list_of_dicts(rows: collections.abc.Iterable[dict[_SupportsRichComparisonT, Any]], key: None = None, reverse: bool = False) -> tuple[_SupportsRichComparisonT, Ellipsis] get_keys_from_list_of_dicts(rows: collections.abc.Iterable[dict[_T, Any]], key: KeyFunc[_T], reverse: bool = False) -> tuple[_T, Ellipsis] Returns all keys of a list of dicts in an ordered tuple. If the list of dicts is irregular, the keys found in later rows are added at the end of the list. Note that the order of keys is otherwise defined by the order of the keys of the dictionaries. So if ordered dictionaries are used, the order is defined. If regular dictionaries are used, the order is undefined. Alternatively, a key and a reverse flag may be provided which will be used to order the fields. If the list of fields is specified, the key and the reverse flag is ignored. .. py:function:: convert_list_of_dicts_to_csv(rows: collections.abc.Iterable[dict[str, Any]], fields: collections.abc.Sequence[str] | None = None, key: KeyFunc[str] | None = None, reverse: bool = False) -> str Takes a list of dictionaries and returns a csv. If no fields are provided, all fields are included in the order of the keys of the first dict. With regular dictionaries this is random. Use an ordered dict or provide a list of fields to have a fixed order. Alternatively, a key and a reverse flag may be provided which will be used to order the fields. If the list of fields is specified, the key and the reverse flag is ignored. The function returns a string created in memory. Therefore this function is limited to small-ish datasets. .. py:function:: convert_list_of_dicts_to_xlsx(rows: collections.abc.Iterable[dict[str, Any]], fields: collections.abc.Sequence[str] | None = None, key: KeyFunc[str] | None = None, reverse: bool = False) -> bytes Takes a list of dictionaries and returns a xlsx. This behaves the same way as :func:`convert_list_of_dicts_to_csv`. .. py:function:: convert_list_of_list_of_dicts_to_xlsx(row_list: collections.abc.Sequence[collections.abc.Iterable[dict[str, Any]]], titles_list: collections.abc.Sequence[str], key_list: collections.abc.Sequence[KeyFunc[str] | None] | None = None, reverse: bool = False) -> bytes Like to :func:`convert_list_of_dicts_to_xlsx`, but operates on a list instead of in a single item. .. py:function:: normalize_sheet_titles(titles: collections.abc.Sequence[str]) -> list[str] Ensuring the title of the xlsx is valid. .. py:function:: avoid_duplicate_name(titles: collections.abc.Sequence[str], title: str) -> str Naive check to see whether name already exists. If name does exist suggest a name using an incrementer Duplicates are case-insensitive .. py:function:: remove_first_word(title: str) -> str Removes all chars from beginning up until and including the first "-". .. py:function:: has_duplicates(a_list: collections.abc.Sequence[Any]) -> bool .. py:function:: list_duplicates_index(a: collections.abc.Sequence[Any]) -> list[int] Returns a list of indexes of duplicates in a list. for example:: a = [1, 2, 3, 2, 1, 5, 6, 5, 5, 5] list_duplicates_index(a) == [3, 4, 7, 8, 9] .. py:function:: parse_header(csv: str, dialect: type[csv.Dialect] | csv.Dialect | str | None = None, rename_duplicate_column_names: bool = False) -> list[str] Takes the first line of the given csv string and returns the headers. Headers are normalized (stripped and normalized) and expected to be unique. The dialect is sniffed, if not provided. :return: A list of headers in the order of appearance. .. py:function:: match_headers(headers: collections.abc.Collection[str], expected: collections.abc.Collection[str]) -> list[str] Takes a list of normalized headers and matches them up against a list of expected headers. The headers may differ from the expected headers. This function tries to match them up using the Levenshtein distance. It does so somewhat carefully by calculating a sane distance using the input. The passed headers are expected to be have been normalized by :func:`normalize_header`, since we usually will pass the result of :func:`parse_header` to this function. For example:: match_headers( headers=['firstname', 'lastname'], expected=['last_name', 'first_name'] ) Results in:: ['first_name', 'last_name'] If no match is possible, an :class:`~onegov.core.errors.MissingColumnsError`, or :class:`~onegov.core.errors.AmbiguousColumnsError` or :class:`~onegov.core.errors.DuplicateColumnNamesError` error is raised. :return: The matched headers in the order of appearance.