core.csv

Offers tools to deal with csv (and xls, xlsx) files.

Attributes

_T

_RowT

VALID_CSV_DELIMITERS

WHITESPACE

INVALID_XLSX_TITLE

small_chars

large_chars

max_width

Classes

CSVFile

Provides access to a csv file.

Functions

detect_encoding(→ str)

Since encoding detection is hard to get right (and work correctly

sniff_dialect(→ type[csv.Dialect])

Takes the given csv string and returns the dialect or raises an error.

normalize_header(→ str)

Normalizes a header value to be as uniform as possible.

convert_xlsx_to_csv(→ io.BytesIO)

Takes an XLS file and returns a csv file using the given worksheet

convert_xls_to_csv(→ io.BytesIO)

Takes an XLS file and returns a csv file using the given worksheet

convert_excel_to_csv(→ io.BytesIO)

Takes an XLS/XLSX file and returns a csv file using the given worksheet

character_width(→ float)

estimate_width(→ float)

get_keys_from_list_of_dicts(…)

Returns all keys of a list of dicts in an ordered tuple.

convert_list_of_dicts_to_csv(→ str)

Takes a list of dictionaries and returns a csv.

convert_list_of_dicts_to_xlsx(→ bytes)

Takes a list of dictionaries and returns a xlsx.

convert_list_of_list_of_dicts_to_xlsx(→ bytes)

Like to convert_list_of_dicts_to_xlsx(), but operates on a list

normalize_sheet_titles(→ list[str])

Ensuring the title of the xlsx is valid.

avoid_duplicate_name(→ str)

Naive check to see whether name already exists.

remove_first_word(→ str)

Removes all chars from beginning up until and including the first "-".

has_duplicates(→ bool)

list_duplicates_index(→ list[int])

Returns a list of indexes of duplicates in a list.

parse_header(→ list[str])

Takes the first line of the given csv string and returns the headers.

match_headers(→ list[str])

Takes a list of normalized headers and matches them up against a

Module Contents

core.csv._T[source]
core.csv._RowT[source]
core.csv.VALID_CSV_DELIMITERS = ',;\t'[source]
core.csv.WHITESPACE[source]
core.csv.INVALID_XLSX_TITLE[source]
core.csv.small_chars = 'fijlrt:,;.+i '[source]
core.csv.large_chars = 'GHMWQ_'[source]
core.csv.max_width = 75[source]
class core.csv.CSVFile(csvfile: IO[bytes], expected_headers: Collection[str] | None = None, dialect: type[Dialect] | Dialect | str | None = None, encoding: str | None = None, rename_duplicate_column_names: bool = False, rowtype: None = None)[source]
class core.csv.CSVFile(csvfile: IO[bytes], expected_headers: Collection[str] | None = None, dialect: type[Dialect] | Dialect | str | None = None, encoding: str | None = None, rename_duplicate_column_names: bool = False, *, rowtype: _RowType[_RowT])

Bases: Generic[_RowT]

Provides access to a csv file.

Parameters:
  • csvfile

    The csv file to be accessed. Must be an open file (not a path), opened in binary mode. For example:

    with open(path, 'rb') as f:
        csv = CSVFile(f)
    

  • expected_headers

    The expected headers if known. Expected headers are headers which must exist in the CSV file. There may be additional headers.

    If the headers are slightly misspelled, a matching algorithm tries to guess the correct header, without accidentally matching the wrong headers.

    See match_headers() for more information.

    If the no expected_headers are passed, no checks are done, but the headers are still available. Headers matching is useful if a user provides the CSV and it might be wrong.

    If it is impossible for misspellings to occurr, the expected headers don’t have to be specified.

  • dialect – The CSV dialect to expect. By default, the dialect will be guessed using Python’s heuristic.

  • encoding – The CSV encoding to expect. By default, the encoding will be guessed and will either be UTF-8 or CP1252.

  • rename_duplicate_column_names – It is possible to rename duplicate column names to deal with super wacky files. If this option is set and a duplicate header is found, a suffix is appended to the column name rather than throwing a DuplicateColumnNamesError.

  • rowtype

    An alternative rowtype for the resulting rows. This should be a callable that receives a rownumber key/value and all the other keys/values found in the csv. The keys are normalized and are valid Python identifiers usable as attribute names.

    Defaults to a namedtuple created using the found headers.

Once the csv file is open, the records can be acceessed as follows:

with open(path, 'rb') as f:
    csv = CSVFile(f)

    for line in csv.lines:
        csv.my_field  # access the column with the 'my_field' header
rowtype: _RowType[_RowT][source]
csvfile[source]
headers[source]
static as_valid_identifier(value: str) str[source]
__iter__() Iterator[_RowT][source]
property lines: Iterator[_RowT][source]
core.csv.detect_encoding(csvfile: IO[bytes]) str[source]

Since encoding detection is hard to get right (and work correctly every time), we limit ourselves here to UTF-8 or CP1252, whichever works first. CP1252 is basically the csv format you get if you use windows and excel and it is a superset of ISO-8859-1/LATIN1.

core.csv.sniff_dialect(csv: str) type[csv.Dialect][source]

Takes the given csv string and returns the dialect or raises an error. Works just like Python’s built in sniffer, just that it is a bit more conservative and doesn’t just accept any kind of character as csv delimiter.

core.csv.normalize_header(header: str) str[source]

Normalizes a header value to be as uniform as possible.

This includes:
  • stripping the whitespace around it

  • lowercasing everything

  • transliterating unicode (e.g. ‘ä’ becomes ‘a’)

  • removing duplicate whitespace inside it

core.csv.convert_xlsx_to_csv(xlsx: IO[bytes], sheet_name: str | None = None) io.BytesIO[source]

Takes an XLS file and returns a csv file using the given worksheet name or the first worksheet found.

core.csv.convert_xls_to_csv(xls: IO[bytes], sheet_name: str | None = None) io.BytesIO[source]

Takes an XLS file and returns a csv file using the given worksheet name or the first worksheet found.

core.csv.convert_excel_to_csv(file: IO[bytes], sheet_name: str | None = None) io.BytesIO[source]

Takes an XLS/XLSX file and returns a csv file using the given worksheet name or the first worksheet found.

core.csv.character_width(char: str) float[source]
core.csv.estimate_width(text: str) float[source]
core.csv.get_keys_from_list_of_dicts(rows: Iterable[dict[_SupportsRichComparisonT, Any]], key: None = None, reverse: bool = False) tuple[_SupportsRichComparisonT, Ellipsis][source]
core.csv.get_keys_from_list_of_dicts(rows: Iterable[dict[_T, Any]], key: KeyFunc[_T], reverse: bool = False) tuple[_T, Ellipsis]

Returns all keys of a list of dicts in an ordered tuple.

If the list of dicts is irregular, the keys found in later rows are added at the end of the list.

Note that the order of keys is otherwise defined by the order of the keys of the dictionaries. So if ordered dictionaries are used, the order is defined. If regular dictionaries are used, the order is undefined.

Alternatively, a key and a reverse flag may be provided which will be used to order the fields. If the list of fields is specified, the key and the reverse flag is ignored.

core.csv.convert_list_of_dicts_to_csv(rows: Iterable[dict[str, Any]], fields: Sequence[str] | None = None, key: KeyFunc[str] | None = None, reverse: bool = False) str[source]

Takes a list of dictionaries and returns a csv.

If no fields are provided, all fields are included in the order of the keys of the first dict. With regular dictionaries this is random. Use an ordered dict or provide a list of fields to have a fixed order.

Alternatively, a key and a reverse flag may be provided which will be used to order the fields. If the list of fields is specified, the key and the reverse flag is ignored.

The function returns a string created in memory. Therefore this function is limited to small-ish datasets.

core.csv.convert_list_of_dicts_to_xlsx(rows: Iterable[dict[str, Any]], fields: Sequence[str] | None = None, key: KeyFunc[str] | None = None, reverse: bool = False) bytes[source]

Takes a list of dictionaries and returns a xlsx.

This behaves the same way as convert_list_of_dicts_to_csv().

core.csv.convert_list_of_list_of_dicts_to_xlsx(row_list: Sequence[Iterable[dict[str, Any]]], titles_list: Sequence[str], key_list: Sequence[KeyFunc[str] | None] | None = None, reverse: bool = False) bytes[source]

Like to convert_list_of_dicts_to_xlsx(), but operates on a list instead of in a single item.

core.csv.normalize_sheet_titles(titles: Sequence[str]) list[str][source]

Ensuring the title of the xlsx is valid.

core.csv.avoid_duplicate_name(titles: Sequence[str], title: str) str[source]

Naive check to see whether name already exists. If name does exist suggest a name using an incrementer Duplicates are case-insensitive

core.csv.remove_first_word(title: str) str[source]

Removes all chars from beginning up until and including the first “-“.

core.csv.has_duplicates(a_list: Sequence[Any]) bool[source]
core.csv.list_duplicates_index(a: Sequence[Any]) list[int][source]

Returns a list of indexes of duplicates in a list. for example:

a = [1, 2, 3, 2, 1, 5, 6, 5, 5, 5]
list_duplicates_index(a) == [3, 4, 7, 8, 9]
core.csv.parse_header(csv: str, dialect: type[Dialect] | Dialect | str | None = None, rename_duplicate_column_names: bool = False) list[str][source]

Takes the first line of the given csv string and returns the headers.

Headers are normalized (stripped and normalized) and expected to be unique. The dialect is sniffed, if not provided.

Returns:

A list of headers in the order of appearance.

core.csv.match_headers(headers: Collection[str], expected: Collection[str]) list[str][source]

Takes a list of normalized headers and matches them up against a list of expected headers.

The headers may differ from the expected headers. This function tries to match them up using the Levenshtein distance. It does so somewhat carefully by calculating a sane distance using the input.

The passed headers are expected to be have been normalized by normalize_header(), since we usually will pass the result of parse_header() to this function.

For example:

match_headers(
    headers=['firstname', 'lastname'],
    expected=['last_name', 'first_name']
)

Results in:

['first_name', 'last_name']

If no match is possible, an MissingColumnsError, or AmbiguousColumnsError or DuplicateColumnNamesError error is raised.

Returns:

The matched headers in the order of appearance.