cldfbench Datasets

While most of cldfbench’s functionality is invoked from the command line via cldfbench subcommands, most of this functionality is implemented in the cldfbench.Dataset class - and derived classes for specific datasets.

class cldfbench.dataset.Dataset[source]

A cldfbench dataset ties together

raw data, to be used as source for the
cldf data, which is created using config data from
etc.

To use the cldfbench infrastructure, one should sub-class Dataset.

cldfbench supports the following workflow: - a download command populates a Dataset’s raw directory. - a makecldf command (re)creates the CLDF dataset in cldf.

The following class attributes are supposed to be overwritten by subclasses:

Variables:

dir – pathlib.Path pointing to the root directory of the dataset.
id – A str identifier for the dataset. No assumption about uniqueness properties of this identifier is made.
metadata_cls – Subclass of Metadata (or Metadata if not overwritten)

property repo: Repository | None: The git repository cloned to the dataset’s directory (or None).

Metadata

class cldfbench.metadata.Metadata(id=None, title=None, description=None, license=None, url=None, citation=None)[source]

Dataset metadata is used as follows:

it is (partly) elicited when creating a new dataset directory …
… and subsequently written to the directory …
… where it may be edited (“by hand”) …
… and from where it is read when initializing a Dataset object.

To add custom metadata fields for a dataset,

inherit from Metadata,
add more attr.ib s,
register the subclass with the dataset by assigning it to cldfbench.Dataset.metadata_cls.

Parameters:

id (str) –
title (str) –
description (str) –
license (str) –
url (str) –
citation (str) –

cldfbench Dataset vs CLDF Dataset

A cldfbench Dataset wraps “raw” source data, conversion code and generated CLDF data into a package. It’s possible for one cldfbench Dataset to create more than one CLDF Dataset. Access to the CLDF Datasets maintained in a cldfbench Dataset is provided as follows:

Dataset.cldf_specs()[source]

A Dataset must declare all CLDF datasets that are derived from it.

Return type:: typing.Union[cldfbench.cldf.CLDFSpec, dict[typing.Optional[str], cldfbench.cldf.CLDFSpec]]
Returns:: A single CLDFSpec instance, or a dict, mapping names to CLDFSpec instances, where the name will be used by cldf_reader/cldf_writer to look up the spec.

Dataset.cldf_specs_dict

Turn cldf_specs() into a dict for simpler lookup.

Returns:: dict mapping lookup keys to CLDFSpec instances.

Dataset.cldf_writer(args, cldf_spec=None, clean=True)[source]

Parameters:

args (argparse.Namespace) – Namespace passed in when initializing the CLDFWriter instance.
cldf_spec (typing.Union[cldfbench.cldf.CLDFSpec, str, None]) – Key of the relevant CLDFSpec in Dataset.cldf_specs
clean (bool) – bool flag signaling whether to clean the CLDF dir before writing. Note that False must be passed for subsequent calls to cldf_writer in case the spec re-uses a directory.

Return type:

cldfbench.cldf.CLDFWriter

Returns:

a cldf_spec.writer_cls instance, for write-access to CLDF data. This method should be used in a with-statement, and will then return a CLDFWriter with an empty working directory.

Dataset.cldf_reader(cldf_spec=None)[source]

Parameters:: cldf_spec (typing.Union[cldfbench.cldf.CLDFSpec, str, None]) – Key of the relevant CLDFSpec in Dataset.cldf_specs.
Return type:: pycldf.dataset.Dataset
Returns:: a pycldf.Dataset instance, for read-access to the CLDF data.

Configuring CLDF writing

class cldfbench.CLDFSpec(dir, module='Generic', default_metadata_path=None, metadata_fname=None, data_fnames=<factory>, writer_cls=<class 'cldfbench.cldf.CLDFWriter'>, zipped=<factory>)[source]

Basic specification to initialize a CLDF Dataset.

Variables:

dir – A directory where the CLDF data is located.
module – pycldf.Dataset subclass or name of a CLDF module
default_metadata_path – Path to the source file for the default metadata for a dataset.
metadata_fname – Filename to be used for the actual copy of the metadata.
data_fnames – A dict mapping component names to custom csv file names (which may be important if multiple different CLDF datasets are created in the same directory).
writer_cls – CLDFWriter subclass to use for writing the data.
zipped – An iterable listing component names or csv file names for which the corresponding tables should be zipped.

Parameters:

dir (pathlib.Path) –
module (str) –
default_metadata_path (typing.Optional[pathlib.Path]) –
metadata_fname (typing.Optional[str]) –
data_fnames (typing.Optional[dict[str, str]]) –
writer_cls (type) –
zipped (typing.Union[set[str], list[str]]) –

class cldfbench.CLDFWriter(cldf_spec=None, args=None, dataset=None, clean=True)[source]

An object mediating writing data as proper CLDF dataset.

Implements a context manager which upon exiting will write all objects acquired within the context to disk.

Variables:

cldf_spec – CLDFSpec instance, configuring the CLDF dataset written by the writer.
objects – dict of list s to collect the data items. Will be passed as kwargs to pycldf.Dataset.write.

Usage:

>>> with Writer(cldf_spec) as writer:
...     writer.objects['ValueTable'].append(...)

Parameters:

cldf_spec (typing.Optional[cldfbench.cldf.CLDFSpec]) –
args (argparse.Namespace) –
dataset (typing.Optional[pycldf.dataset.Dataset]) –
clean (bool) –

property cldf: Dataset

The pycldf.Dataset used to write the data.

Raises:: AttributeError – If accessed outside of the context managed by this writer.

write(**kw)[source]: Write the data specified as lists of rows according to the metadata.

Accessing data

The three “data” directories can be accessed a cldfbench.DataDir instances:

Dataset.raw_dir: Directory where cldfbench expects the raw or source data.

Dataset.etc_dir: Directory where cldfbench expects additional configuration or metadata.

Dataset.cldf_dir: Directory where CLDF data generated from the Dataset will be stored (unless specified differently by a CLDFSpec).

class cldfbench.datadir.DataDir(*args, **kwargs)[source]

A pathlib.Path augmented with functionality to read common data formats.

read(fname, aname=None, normalize=None, suffix=None, encoding='utf8')[source]

Read text data from a file.

Parameters:

fname (typing.Union[str, pathlib.Path]) – Name of a file in DataDir or any pathlib.Path.
aname (str) – “file in archive” name, if a file from a zip archive is to be read.
suffix (str) – If None, suffix will be inferred from the path to be read. Otherwise it can be used to force reading compressed content passing .gz or .zip.
normalize (typing.Optional[typing.Literal['NFC', 'NFKC', 'NFD', 'NFKD']]) – Any normalization form understood by unicodedata.normalize.
encoding (str) –

Return type:

str

write(fname, text, encoding='utf8')[source]

Write text data to a file.

Parameters:

fname (typing.Union[str, pathlib.Path]) – Name of a file in DataDir or any pathlib.Path.
text (str) –

read_csv(fname, normalize=None, **kw)[source]

Read CSV data from a file.

Parameters:

fname (typing.Union[str, pathlib.Path]) –
normalize (typing.Optional[typing.Literal['NFC', 'NFKC', 'NFD', 'NFKD']]) –

Return type:

list[typing.Union[dict[str, str], list[str]]]

write_csv(fname, rows, **kw)[source]

Write CSV data to a file.

Parameters:

fname (typing.Union[str, pathlib.Path]) –
rows (collections.abc.Iterable[list[str]]) –

read_xml(fname, wrap=True)[source]

Reads and parses XML from a file.

Parameters:: fname (typing.Union[str, pathlib.Path]) –
Return type:: xml.etree.ElementTree.Element

read_json(fname, **_)[source]

Read a JSON file.

Parameters:: fname (typing.Union[str, pathlib.Path]) –
Return type:: typing.Union[str, list, dict]

read_bib(fname='sources.bib')[source]

Read a BibTeX file.

Parameters:: fname (typing.Union[str, pathlib.Path]) –
Return type:: list[pycldf.sources.Source]

ods2csv(fname, outdir=None)[source]

Dump the data from an OpenDocument Spreadsheet (suffix .ODS) file to CSV.

Note

Requires cldfbench to be installed with extra “odf”.

Parameters:

fname (typing.Union[str, pathlib.Path]) –
outdir (typing.Optional[pathlib.Path]) –

Return type:

dict[str, pathlib.Path]

xls2csv(fname, outdir=None)[source]

Dump the data from an Excel XLS file to CSV.

Note

Requires cldfbench to be installed with extra “excel”.

Parameters:

fname (typing.Union[str, pathlib.Path]) –
outdir (typing.Optional[pathlib.Path]) –

Return type:

dict[str, pathlib.Path]

xlsx2csv(fname, outdir=None)[source]

Dump the data from an Excel XLSX file to CSV.

Note

Requires cldfbench to be installed with extra “excel”.

Parameters:

fname (typing.Union[str, pathlib.Path]) –
outdir (typing.Optional[pathlib.Path]) –

Return type:

dict[str, pathlib.Path]

temp_download(url, fname, log=None)[source]

Context manager to use when downloaded data needs to be manipulated before storage (e.g. to anonymize it).

Usage:

with ds.raw_dir.temp_download('http://example.org/data.txt') as p:
    ds.raw_dir.write('data.txt', p.read_text(encoding='utf8').split('##')[0])

Parameters:

url (str) –
fname (typing.Union[str, pathlib.Path]) –
log (typing.Optional[logging.Logger]) –

Return type:

pathlib.Path

download(url, fname, log=None, skip_if_exists=False)[source]

Download data from a URL to the directory.

Parameters:

url (str) –
fname (typing.Union[str, pathlib.Path]) –
log (typing.Optional[logging.Logger]) –
skip_if_exists (bool) –

Return type:

pathlib.Path

download_and_unpack(url, *paths, **kw)[source]

Download a zipfile and immediately unpack selected content.

Parameters:

url (str) – URL from where to download the archive.
paths (str) – Path names to be compared to ZipInfo.filename.
kw –

Curation workflow

Workflow commands are implemented with two methods for each command:

cmd_<command>: The implementation of the command, typically overwritten by datasets.
_cmd_<command>: An (optional) wrapper providing setup and teardown functionality, calling cmd_<command> in between.

Workflow commands must accept an argparse.Namespace as sole positional argument.

Dataset.cmd_download(args)[source]

Implementations of this methods should populate the dataset’s raw_dir with the source data.

Parameters:: args (argparse.Namespace) –

Dataset.cmd_makecldf(args)[source]

Implementations of this method should write the CLDF data curated by the dataset.

Parameters:: args (argparse.Namespace) – An argparse.Namespace including attributes: - writer: CLDFWriter instance

Dataset.cmd_readme(_)[source]

Implementations of this method should create the content for the dataset’s README.md and return it as markdown formatted string.

Parameters:: _ (argparse.Namespace) –
Return type:: str

Dataset.update_submodules()[source]: Convenience method to be used in a Dataset’s cmd_download to update raw data curated as git submodules.

Dataset discovery

cldfbench Datasets may be packaged as installable Pyhthon packages. In this case they may advertise an entry point <https://packaging.python.org/specifications/entry-points/> pointing to their cldfbench.Dataset subclass. Such entry points may be used to discover datasets.

cldfbench.dataset.iter_datasets(ep='cldfbench.dataset')[source]

Yields Dataset instances registered for the specified entry point.

Parameters:: ep (str) – Name of the entry point.
Return type:: collections.abc.Generator[cldfbench.dataset.Dataset, None, None]

cldfbench.dataset.get_dataset(spec, ep='cldfbench.dataset')[source]

Get an initialised Dataset instance.

Parameters:

spec – Specification of the dataset, either an ID or a path to a Python module containing a subclass of Dataset.
ep (str) –

Return type:

typing.Optional[cldfbench.dataset.Dataset]

cldfbench.dataset.get_datasets(spec, ep='cldfbench.dataset', glob=False)[source]

Parameters:

spec – Either ‘*’ to get all datasets for a specific entry point, or glob pattern matching dataset modules in the current directory (if glob == True), or a str as accepted by get_dataset().
glob (bool) –

Return type:

list[cldfbench.dataset.Dataset]