🗨️ Q & A

What are the different types of typed DataFrames?

You should generally use two: typeddfs.typed_dfs.TypedDf and typeddfs.matrix_dfs.MatrixDf. There is also a specialized matrix type, typeddfs.matrix_dfs.AffinityMatrixDf. You can construct these easily with typeddfs._entries.TypedDfs.typed(), typeddfs._entries.TypedDfs.matrix(), and typeddfs._entries.TypedDfs.affinity_matrix(). There is a final type, defined to have no typing rules, that can be constructed with typeddfs._entries.TypedDfs.untyped(). You can convert a vanilla Pandas DataFrame to an “untyped” variant via typeddfs._entries.TypedDfs.wrap() to give it the additional methods.

from typeddfs import TypedDfs

MyDf = TypedDfs.typed("MyDf").build()

What is the hierarchy of DataFrames?

It’s confusing. In general, you won’t need to know the difference.

typeddfs.typed_dfs.TypedDf and typeddfs.matrix_dfs.MatrixDf inherit from typeddfs.base_dfs.BaseDf, which inherits from typeddfs.abs_dfs.AbsDf, which inherits from typeddfs._core_dfs.CoreDf. (Technically, CoreDf inherits from typeddfs._pretty_dfs.PrettyDf.) The difference is:

What is the difference between __init__, convert, and of?

These three methods in typeddfs.typed_dfs.TypedDf (and its superclasses) are a bit different. typeddfs.typed_dfs.TypedDf.__init__() does NOT attempt to reorganize or validate your DataFrame, while typeddfs.typed_dfs.TypedDf.convert() and typeddfs.typed_dfs.TypedDf.of() do.``of`` is simply more flexible than convert: convert only accepts a DataFrame, while of will take anything that DataFrame.__init__ will.

When do typed DFs “detype” during chained invocations?

Most DataFrame-level functions that ordinarily return DataFrames themselves try to keep the same type. This includes typeddfs.abs_dfs.AbsDf.reindex(), typeddfs.abs_dfs.AbsDf.drop_duplicates(), typeddfs.abs_dfs.AbsDf.sort_values(), and typeddfs.abs_dfs.AbsDf.set_index(). This is to allow for easy chained invocation, but it’s important to note that the returned DataFrame might not conform to your requirements. Call typeddfs.abs_dfs.AbsDf.retype() at the end to reorganize and verify.

from typeddfs import TypedDfs

MyDf = TypedDfs.typed("MyDf").require("valid").build()
my_df = MyDf.read_csv("x.csv")
my_df_2 = my_df.drop_duplicates().rename_cols(valid="ok")
print(type(my_df_2))  # type(MyDf)
# but this fails!
my_df_3 = my_df.drop_duplicates().rename_cols(valid="ok").retype()
# MissingColumnError "valid"

You can call typeddfs.abs_dfs.AbsDf.dtype() to remove any typing rules and typeddfs.abs_dfs.AbsDf.vanilla() if you need a plain DataFrame, though this should rarely be needed.

How does one get the typing info?

Call typeddfs.base_dfs.BaseDf.get_typing()

from typeddfs import TypedDfs

MyDf = TypedDfs.typed("MyDf").require("valid").build()
MyDf.get_typing().required_columns  # ["valid"]

How are toml documents read and written?

These are limited to a single array of tables (AOT). The AOT is named row by default (set with aot=). On read, you can pass aot=None to have it use the unique outermost key. `

How are INI files read and written?

These require exactly 2 columns after reset_index(). Parsing is purposefully minimal because these formats are flexible. Trailing whitespace and whitespace surrounding = is ignored. Values are not escaped, and keys may not contain =. Line continuation with \ is not allowed. Quotation marks surrounding values are not dropped, unless drop_quotes=True is passed. Comments begin with ;, along with # if hash_sign=True is passed.

On read, section names are prepended to the keys. For example, the key name will be section.key in this example:

[section]
key = value

On write, the inverse happens.

What about .properties?

These are similar to INI files. Only hash signs are allowed for comments, and reserved chars are escaped in keys. This includes \\,``, ``\=, and \: These are not escaped in values.

What is “flex-width format”?

This is a format that shows up a lot in the wild, but doesn’t seem to have a name. It’s just a text format like TSV or CSV, but where columns are preferred to line up in a fixed-width font. Whitespace is ignored on read, but on write the columns are made to line up neatly. These files are easy to view. By default, the delimiter is three vertical bars (|||).

When are read and write guaranteed to be inverses?

In principle, this invariant holds when you call .strict() to disallow additional columns and specify dtype= in all calls to .require and .reserve. In practice, this might break down for certain combinations of DataFrame structure, dtypes, and serialization format. It seems pretty solid for Feather, Parquet, and CSV/TSV-like variants, especially if the dtypes are limited to bools, real values, int values, and strings. There may be corner cases for XML, TOML, INI, Excel, OpenDocument, and HDF5, as well as for categorical and miscellaneous object dtypes.

How do I include another filename suffix?

Use .suffix() to register a suffix or remap it to another format.

from typeddfs import TypedDfs, FileFormat

MyDf = TypedDfs.typed("MyDf").suffix(tabbed="tsv").build()
# or:
MyDf = TypedDfs.typed("MyDf").suffix(**{".tabbed": FileFormat.tsv}).build()

How do the checksums work?

There are simple convenience flags to write sha1sum-like files while writing files, and to verify them when reading.

from pathlib import Path
from typeddfs import TypedDfs

MyDf = TypedDfs.typed("MyDf").build()
df = MyDf()
df.write_file("here.csv", file_hash=True)
# a hex-encoded hash and filename
Path("here.csv.sha256").read_text(encoding="utf8")
MyDf.read_file("here.csv", file_hash=True)  # verifies that it matches

You can change the hash algorithm with .hash(). The second variant is dir_hash.

from pathlib import Path
from typeddfs import TypedDfs, Checksums

MyDf = TypedDfs.typed("MyDf").build()
df = MyDf()
path = Path("dir", "here.csv")
df.write_file(path, dir_hash=True, mkdirs=True)
# potentially many hex-encoded hashes and filenames; always appended to
MyDf.read_file(path, dir_hash=True)  # verifies that it matches
# read it
sums = Checksums.parse_hash_file_resolved(Path("my_dir", "my_dir.sha256"))