cldfbench Datasets

While most of cldfbench’s functionality is invoked from the command line via cldfbench subcommands, most of this functionality is implemented in the cldfbench.Dataset class - and derived classes for specific datasets.

class cldfbench.dataset.Dataset[source]

A cldfbench dataset ties together

  • raw data, to be used as source for the

  • cldf data, which is created using config data from

  • etc.

To use the cldfbench infrastructure, one should sub-class Dataset.

cldfbench supports the following workflow: - a download command populates a Dataset’s raw directory. - a makecldf command (re)creates the CLDF dataset in cldf.

The following class attributes are supposed to be overwritten by subclasses:

Variables:
  • dirpathlib.Path pointing to the root directory of the dataset.

  • id – A str identifier for the dataset. No assumption about uniqueness properties of this identifier is made.

  • metadata_cls – Subclass of Metadata (or Metadata if not overwritten)

repo[source]

The git repository cloned to the dataset’s directory (or None).

Metadata

class cldfbench.metadata.Metadata(id=None, title=None, description=None, license=None, url=None, citation=None)[source]

Dataset metadata is used as follows:

  • it is (partly) elicited when creating a new dataset directory …

  • … and subsequently written to the directory …

  • … where it may be edited (“by hand”) …

  • … and from where it is read when initializing a Dataset object.

To add custom metadata fields for a dataset,

  • inherit from Metadata,

  • add more attr.ib s,

  • register the subclass with the dataset by assigning it to cldfbench.Dataset.metadata_cls.

cldfbench Dataset vs CLDF Dataset

A cldfbench Dataset wraps “raw” source data, conversion code and generated CLDF data into a package. It’s possible for one cldfbench Dataset to create more than one CLDF Dataset. Access to the CLDF Datasets maintained in a cldfbench Dataset is provided as follows:

Dataset.cldf_specs()[source]

A Dataset must declare all CLDF datasets that are derived from it.

Return type:

typing.Union[cldfbench.cldf.CLDFSpec, typing.Dict[str, cldfbench.cldf.CLDFSpec]]

Returns:

A single CLDFSpec instance, or a dict, mapping names to CLDFSpec instances, where the name will be used by cldf_reader/cldf_writer to look up the spec.

Dataset.cldf_specs_dict

Turn cldf_specs() into a dict for simpler lookup.

Returns:

dict mapping lookup keys to CLDFSpec instances.

Dataset.cldf_writer(args, cldf_spec=None, clean=True)[source]
Parameters:
  • args (argparse.Namespace) – Namespace passed in when initializing the CLDFWriter instance.

  • cldf_spec (typing.Union[str, cldfbench.cldf.CLDFSpec, None]) – Key of the relevant CLDFSpec in Dataset.cldf_specs

  • clean (bool) – bool flag signaling whether to clean the CLDF dir before writing. Note that False must be passed for subsequent calls to cldf_writer in case the spec re-uses a directory.

Return type:

cldfbench.cldf.CLDFWriter

Returns:

a cldf_spec.writer_cls instance, for write-access to CLDF data. This method should be used in a with-statement, and will then return a CLDFWriter with an empty working directory.

Dataset.cldf_reader(cldf_spec=None)[source]
Parameters:

cldf_spec (typing.Union[str, cldfbench.cldf.CLDFSpec, None]) – Key of the relevant CLDFSpec in Dataset.cldf_specs.

Return type:

pycldf.dataset.Dataset

Returns:

a pycldf.Dataset instance, for read-access to the CLDF data.

Configuring CLDF writing

class cldfbench.CLDFSpec(dir, module='Generic', default_metadata_path=None, metadata_fname=None, data_fnames=_Nothing.NOTHING, writer_cls=<class 'cldfbench.cldf.CLDFWriter'>, zipped=_Nothing.NOTHING)[source]

Basic specification to initialize a CLDF Dataset.

Variables:
  • dir – A directory where the CLDF data is located.

  • modulepycldf.Dataset subclass or name of a CLDF module

  • default_metadata_path – Path to the source file for the default metadata for a dataset.

  • metadata_fname – Filename to be used for the actual copy of the metadata.

  • data_fnames – A dict mapping component names to custom csv file names (which may be important if multiple different CLDF datasets are created in the same directory).

  • writer_clsCLDFWriter subclass to use for writing the data.

  • zipped – An iterable listing component names or csv file names for which the corresponding tables should be zipped.

class cldfbench.CLDFWriter(cldf_spec=None, args=None, dataset=None, clean=True)[source]

An object mediating writing data as proper CLDF dataset.

Implements a context manager which upon exiting will write all objects acquired within the context to disk.

Variables:
  • cldf_specCLDFSpec instance, configuring the CLDF dataset written by the writer.

  • objectsdict of list s to collect the data items. Will be passed as kwargs to pycldf.Dataset.write.

Usage:

>>> with Writer(cldf_spec) as writer:
...     writer.objects['ValueTable'].append(...)
property cldf: Dataset

The pycldf.Dataset used to write the data.

Raises:

AttributeError – If accessed outside of the context managed by this writer.

Accessing data

The three “data” directories can be accessed a cldfbench.DataDir instances:

Dataset.raw_dir[source]

Directory where cldfbench expects the raw or source data.

Dataset.etc_dir[source]

Directory where cldfbench expects additional configuration or metadata.

Dataset.cldf_dir[source]

Directory where CLDF data generated from the Dataset will be stored (unless specified differently by a CLDFSpec).

class cldfbench.datadir.DataDir(*args, **kwargs)[source]

A pathlib.Path augmented with functionality to read common data formats.

read(fname, aname=None, normalize=None, suffix=None, encoding='utf8')[source]

Read text data from a file.

Parameters:
  • fname (typing.Union[str, pathlib.Path]) – Name of a file in DataDir or any pathlib.Path.

  • aname (str) – “file in archive” name, if a file from a zip archive is to be read.

  • suffix (str) – If None, suffix will be inferred from the path to be read. Otherwise it can be used to force reading compressed content passing .gz or .zip.

  • normalize (str) – Any normalization form understood by unicodedata.normalize.

  • encoding (str) –

Return type:

str

write(fname, text, encoding='utf8')[source]

Write text data to a file.

Parameters:
  • fname (typing.Union[str, pathlib.Path]) – Name of a file in DataDir or any pathlib.Path.

  • text (str) –

read_csv(fname, normalize=None, **kw)[source]

Read CSV data from a file.

Parameters:

fname (typing.Union[str, pathlib.Path]) –

Return type:

typing.List[typing.Union[dict, list]]

write_csv(fname, rows, **kw)[source]

Write CSV data to a file.

Parameters:
  • fname (typing.Union[str, pathlib.Path]) –

  • rows (typing.Iterable[typing.List[str]]) –

read_xml(fname, wrap=True)[source]

Reads and parses XML from a file.

Parameters:

fname (typing.Union[str, pathlib.Path]) –

Return type:

xml.etree.ElementTree.Element

ods2csv(fname, outdir=None)[source]

Dump the data from an OpenDocument Spreadsheet (suffix .ODS) file to CSV.

Note

Requires cldfbench to be installed with extra “odf”.

Parameters:
  • fname (typing.Union[str, pathlib.Path]) –

  • outdir (typing.Optional[pathlib.Path]) –

Return type:

typing.Dict[str, pathlib.Path]

xls2csv(fname, outdir=None)[source]

Dump the data from an Excel XLS file to CSV.

Note

Requires cldfbench to be installed with extra “excel”.

Parameters:
  • fname (typing.Union[str, pathlib.Path]) –

  • outdir (typing.Optional[pathlib.Path]) –

Return type:

typing.Dict[str, pathlib.Path]

xlsx2csv(fname, outdir=None)[source]

Dump the data from an Excel XLSX file to CSV.

Note

Requires cldfbench to be installed with extra “excel”.

Parameters:
  • fname (typing.Union[str, pathlib.Path]) –

  • outdir (typing.Optional[pathlib.Path]) –

Return type:

typing.Dict[str, pathlib.Path]

temp_download(url, fname, log=None)[source]

Context manager to use when downloaded data needs to be manipulated before storage (e.g. to anonymize it).

Usage:

with ds.raw_dir.temp_download('http://example.org/data.txt') as p:
    ds.raw_dir.write('data.txt', p.read_text(encoding='utf8').split('##')[0])
Parameters:
  • url (str) –

  • fname (typing.Union[str, pathlib.Path]) –

Return type:

pathlib.Path

download(url, fname, log=None, skip_if_exists=False)[source]

Download data from a URL to the directory.

Parameters:
  • url (str) –

  • fname (typing.Union[str, pathlib.Path]) –

download_and_unpack(url, *paths, **kw)[source]

Download a zipfile and immediately unpack selected content.

Parameters:
  • url (str) – URL from where to download the archive.

  • paths (str) – Path names to be compared to ZipInfo.filename.

  • kw

Curation workflow

Workflow commands are implemented with two methods for each command:

  • cmd_<command>: The implementation of the command, typically overwritten by datasets.

  • _cmd_<command>: An (optional) wrapper providing setup and teardown functionality, calling cmd_<command> in between.

Workflow commands must accept an argparse.Namespace as sole positional argument.

Dataset.cmd_download(args)[source]

Implementations of this methods should populate the dataset’s raw_dir with the source data.

Parameters:

args (argparse.Namespace) –

Dataset.cmd_makecldf(args)[source]

Implementations of this method should write the CLDF data curated by the dataset.

Parameters:

args (argparse.Namespace) – An argparse.Namespace including attributes: - writer: CLDFWriter instance

Dataset.cmd_readme(args)[source]

Implementations of this method should create the content for the dataset’s README.md and return it as markdown formatted string.

Parameters:

args (argparse.Namespace) –

Return type:

str

Dataset.update_submodules()[source]

Convenience method to be used in a Dataset’s cmd_download to update raw data curated as git submodules.

Dataset discovery

cldfbench Datasets may be packaged as installable Pyhthon packages. In this case they may advertise an entry point <https://packaging.python.org/specifications/entry-points/> pointing to their cldfbench.Dataset subclass. Such entry points may be used to discover datasets.

cldfbench.dataset.iter_datasets(ep='cldfbench.dataset')[source]

Yields Dataset instances registered for the specified entry point.

Parameters:

ep (str) – Name of the entry point.

Return type:

typing.Generator[cldfbench.dataset.Dataset, None, None]

cldfbench.dataset.get_dataset(spec, ep='cldfbench.dataset')[source]

Get an initialised Dataset instance.

Parameters:

spec – Specification of the dataset, either an ID or a path to a Python module containing a subclass of Dataset.

Return type:

cldfbench.dataset.Dataset

cldfbench.dataset.get_datasets(spec, ep='cldfbench.dataset', glob=False)[source]
Parameters:
  • spec – Either ‘*’ to get all datasets for a specific entry point, or glob pattern matching dataset modules in the current directory (if glob == True), or a str as accepted by get_dataset().

  • glob (bool) –

Return type:

typing.List[cldfbench.dataset.Dataset]