cldfbench Datasets
While most of cldfbench’s functionality is invoked from the command line
via cldfbench subcommands, most of this functionality is implemented in
the cldfbench.Dataset
class - and derived classes for specific datasets.
- class cldfbench.dataset.Dataset[source]
A cldfbench dataset ties together
raw data, to be used as source for the
cldf data, which is created using config data from
etc.
To use the cldfbench infrastructure, one should sub-class Dataset.
cldfbench supports the following workflow: - a download command populates a Dataset’s raw directory. - a makecldf command (re)creates the CLDF dataset in cldf.
The following class attributes are supposed to be overwritten by subclasses:
- Variables:
dir – pathlib.Path pointing to the root directory of the dataset.
id – A str identifier for the dataset. No assumption about uniqueness properties of this identifier is made.
metadata_cls – Subclass of
Metadata
(orMetadata
if not overwritten)
Metadata
- class cldfbench.metadata.Metadata(id=None, title=None, description=None, license=None, url=None, citation=None)[source]
Dataset metadata is used as follows:
it is (partly) elicited when creating a new dataset directory …
… and subsequently written to the directory …
… where it may be edited (“by hand”) …
… and from where it is read when initializing a Dataset object.
To add custom metadata fields for a dataset,
inherit from Metadata,
add more attr.ib s,
register the subclass with the dataset by assigning it to cldfbench.Dataset.metadata_cls.
cldfbench Dataset vs CLDF Dataset
A cldfbench Dataset wraps “raw” source data, conversion code and generated CLDF data into a package. It’s possible for one cldfbench Dataset to create more than one CLDF Dataset. Access to the CLDF Datasets maintained in a cldfbench Dataset is provided as follows:
- Dataset.cldf_specs()[source]
A Dataset must declare all CLDF datasets that are derived from it.
- Return type:
typing.Union
[cldfbench.cldf.CLDFSpec
,typing.Dict
[str
,cldfbench.cldf.CLDFSpec
]]- Returns:
A single
CLDFSpec
instance, or a dict, mapping names to CLDFSpec instances, where the name will be used by cldf_reader/cldf_writer to look up the spec.
- Dataset.cldf_specs_dict
Turn
cldf_specs()
into a dict for simpler lookup.- Returns:
dict mapping lookup keys to CLDFSpec instances.
- Dataset.cldf_writer(args, cldf_spec=None, clean=True)[source]
- Parameters:
args (
argparse.Namespace
) – Namespace passed in when initializing the CLDFWriter instance.cldf_spec (
typing.Union
[str
,cldfbench.cldf.CLDFSpec
,None
]) – Key of the relevant CLDFSpec in Dataset.cldf_specsclean (
bool
) – bool flag signaling whether to clean the CLDF dir before writing. Note that False must be passed for subsequent calls to cldf_writer in case the spec re-uses a directory.
- Return type:
- Returns:
a cldf_spec.writer_cls instance, for write-access to CLDF data. This method should be used in a with-statement, and will then return a CLDFWriter with an empty working directory.
- Dataset.cldf_reader(cldf_spec=None)[source]
- Parameters:
cldf_spec (
typing.Union
[str
,cldfbench.cldf.CLDFSpec
,None
]) – Key of the relevant CLDFSpec in Dataset.cldf_specs.- Return type:
pycldf.dataset.Dataset
- Returns:
a pycldf.Dataset instance, for read-access to the CLDF data.
Configuring CLDF writing
- class cldfbench.CLDFSpec(dir, module='Generic', default_metadata_path=None, metadata_fname=None, data_fnames=_Nothing.NOTHING, writer_cls=<class 'cldfbench.cldf.CLDFWriter'>, zipped=_Nothing.NOTHING)[source]
Basic specification to initialize a CLDF Dataset.
- Variables:
dir – A directory where the CLDF data is located.
module – pycldf.Dataset subclass or name of a CLDF module
default_metadata_path – Path to the source file for the default metadata for a dataset.
metadata_fname – Filename to be used for the actual copy of the metadata.
data_fnames – A dict mapping component names to custom csv file names (which may be important if multiple different CLDF datasets are created in the same directory).
writer_cls – CLDFWriter subclass to use for writing the data.
zipped – An iterable listing component names or csv file names for which the corresponding tables should be zipped.
- class cldfbench.CLDFWriter(cldf_spec=None, args=None, dataset=None, clean=True)[source]
An object mediating writing data as proper CLDF dataset.
Implements a context manager which upon exiting will write all objects acquired within the context to disk.
- Variables:
cldf_spec –
CLDFSpec
instance, configuring the CLDF dataset written by the writer.objects – dict of list s to collect the data items. Will be passed as kwargs to pycldf.Dataset.write.
Usage:
>>> with Writer(cldf_spec) as writer: ... writer.objects['ValueTable'].append(...)
- property cldf: Dataset
The pycldf.Dataset used to write the data.
- Raises:
AttributeError – If accessed outside of the context managed by this writer.
Accessing data
The three “data” directories can be accessed a cldfbench.DataDir
instances:
- Dataset.cldf_dir[source]
Directory where CLDF data generated from the Dataset will be stored (unless specified differently by a
CLDFSpec
).
- class cldfbench.datadir.DataDir(*args, **kwargs)[source]
A pathlib.Path augmented with functionality to read common data formats.
- read(fname, aname=None, normalize=None, suffix=None, encoding='utf8')[source]
Read text data from a file.
- Parameters:
fname (
typing.Union
[str
,pathlib.Path
]) – Name of a file in DataDir or any pathlib.Path.aname (
str
) – “file in archive” name, if a file from a zip archive is to be read.suffix (
str
) – If None, suffix will be inferred from the path to be read. Otherwise it can be used to force reading compressed content passing .gz or .zip.normalize (
str
) – Any normalization form understood by unicodedata.normalize.encoding (
str
) –
- Return type:
str
- write(fname, text, encoding='utf8')[source]
Write text data to a file.
- Parameters:
fname (
typing.Union
[str
,pathlib.Path
]) – Name of a file in DataDir or any pathlib.Path.text (
str
) –
- read_csv(fname, normalize=None, **kw)[source]
Read CSV data from a file.
- Parameters:
fname (
typing.Union
[str
,pathlib.Path
]) –- Return type:
typing.List
[typing.Union
[dict
,list
]]
- write_csv(fname, rows, **kw)[source]
Write CSV data to a file.
- Parameters:
fname (
typing.Union
[str
,pathlib.Path
]) –rows (
typing.Iterable
[typing.List
[str
]]) –
- read_xml(fname, wrap=True)[source]
Reads and parses XML from a file.
- Parameters:
fname (
typing.Union
[str
,pathlib.Path
]) –- Return type:
xml.etree.ElementTree.Element
- ods2csv(fname, outdir=None)[source]
Dump the data from an OpenDocument Spreadsheet (suffix .ODS) file to CSV.
Note
Requires cldfbench to be installed with extra “odf”.
- Parameters:
fname (
typing.Union
[str
,pathlib.Path
]) –outdir (
typing.Optional
[pathlib.Path
]) –
- Return type:
typing.Dict
[str
,pathlib.Path
]
- xls2csv(fname, outdir=None)[source]
Dump the data from an Excel XLS file to CSV.
Note
Requires cldfbench to be installed with extra “excel”.
- Parameters:
fname (
typing.Union
[str
,pathlib.Path
]) –outdir (
typing.Optional
[pathlib.Path
]) –
- Return type:
typing.Dict
[str
,pathlib.Path
]
- xlsx2csv(fname, outdir=None)[source]
Dump the data from an Excel XLSX file to CSV.
Note
Requires cldfbench to be installed with extra “excel”.
- Parameters:
fname (
typing.Union
[str
,pathlib.Path
]) –outdir (
typing.Optional
[pathlib.Path
]) –
- Return type:
typing.Dict
[str
,pathlib.Path
]
- temp_download(url, fname, log=None)[source]
Context manager to use when downloaded data needs to be manipulated before storage (e.g. to anonymize it).
Usage:
with ds.raw_dir.temp_download('http://example.org/data.txt') as p: ds.raw_dir.write('data.txt', p.read_text(encoding='utf8').split('##')[0])
- Parameters:
url (
str
) –fname (
typing.Union
[str
,pathlib.Path
]) –
- Return type:
pathlib.Path
Curation workflow
Workflow commands are implemented with two methods for each command:
cmd_<command>: The implementation of the command, typically overwritten by datasets.
_cmd_<command>: An (optional) wrapper providing setup and teardown functionality, calling cmd_<command> in between.
Workflow commands must accept an argparse.Namespace as sole positional argument.
- Dataset.cmd_download(args)[source]
Implementations of this methods should populate the dataset’s raw_dir with the source data.
- Parameters:
args (
argparse.Namespace
) –
- Dataset.cmd_makecldf(args)[source]
Implementations of this method should write the CLDF data curated by the dataset.
- Parameters:
args (
argparse.Namespace
) – An argparse.Namespace including attributes: - writer:CLDFWriter
instance
Dataset discovery
cldfbench Datasets may be packaged as installable Pyhthon packages. In this case they may advertise an entry point <https://packaging.python.org/specifications/entry-points/> pointing to their cldfbench.Dataset subclass. Such entry points may be used to discover datasets.
- cldfbench.dataset.iter_datasets(ep='cldfbench.dataset')[source]
Yields Dataset instances registered for the specified entry point.
- Parameters:
ep (
str
) – Name of the entry point.- Return type:
typing.Generator
[cldfbench.dataset.Dataset
,None
,None
]
- cldfbench.dataset.get_dataset(spec, ep='cldfbench.dataset')[source]
Get an initialised Dataset instance.
- Parameters:
spec – Specification of the dataset, either an ID or a path to a Python module containing a subclass of
Dataset
.- Return type:
- cldfbench.dataset.get_datasets(spec, ep='cldfbench.dataset', glob=False)[source]
- Parameters:
spec – Either ‘*’ to get all datasets for a specific entry point, or glob pattern matching dataset modules in the current directory (if glob == True), or a str as accepted by
get_dataset()
.glob (
bool
) –
- Return type:
typing.List
[cldfbench.dataset.Dataset
]