Technical Documentation : Connectors : Dataset Detection on File Systems
Was this helpful?
Dataset Detection on File Systems
Introduction
The purpose of this document is to detail the rules applied in File System-type Zeenea connector to detect Datasets.
The algorithm checks all the objects from the root path and identifies if they are datasets. Once a folder is identified as a dataset, the algorithm stops its treatment and goes to the next folder.
Folder Containing Only Files
Rule 1
A folder is a dataset when it only contains files and when at least one file has a supported extension.
Supported extensions are: csv, parquet, orc, xml, json, avro
* The "Client" folder is a dataset (rule 1).
* The "Project" folder is considered a unique dataset (rule 1).
* The folder contains only files, and at least one of them has a supported extension.
* The folder contains only files, and at least one of them has a supported extension.
* Zeenea will extract the Schema from the most recent file (in this case, Client20190827.csv).
* Zeenea will extract the schema from the most recent file.
 
* If the files are not homogenous, the documented schema may then change upon the analysis treatment.
Folder with Subfolders
Rule 2
When a folder contains a sub-folder, whose name doesn't follow partition naming conventions, that folder is not a dataset.
Rule 3
A file may be a dataset when it is within a folder containing sub-folders. The file must however possess a supported extension.
* The "Client" folder is not a dataset (rule 2).
* The "Client" folder is not a dataset (rule 2).
* It contains subfolders PP and PM, whose names do not follow dataset naming convention.
* It contains subfolder PP, whose name does not follow dataset naming convention.
* Sub-folders PP and PM are, however, datasets (rule 1). They only contain files.
* The "PP" folder is a dataset (rule 1). It only contains files.
 
Files "Client20190225.csv" and "Client20190226.csv" are datasets (rule 3).
Files with a supported extension.
Folder with Partitions
Rule 4
A folder is a dataset when:
  • All its subfolders' names follow the partition naming convention.
  • At least one of its subfolders would be considered as a dataset had it been isolated.
* The "Client" folder is a dataset (rule 4).
* The "Client" Folder is not a dataset (rule 2).
* "Client" contains subfolders "2019" and "2018", whose names follow the dataset naming convention.
* It contains subfolders "PP" and "2019", whose names do not follow the dataset naming convention.
* "2019" would have been a dataset, had it not been contained within "Client" (rule 3).
* Sub-Folder 2019 is a dataset (rule 4). Its subfolder would be considered a dataset had it been isolated.
* "2019" contains subfolders "05" and "08", whose names follow the dataset naming convention.
* Sub-folder "PP" is a dataset (rule 1). It only contains files.
* "08" would have been a dataset, had it not been contained within "2019"
 
* "08" only contains files, of which at least has a supported extension.
 
Partition Naming Convention
Subfolders are considered partitions if their names match this regular expression:
"(.*=.*)", "[0-9]{8}", "[0-9]{4}", "[0-9]{2}", "0?[1-9]|1[012]", "0?[1-9]|1[0-9]|2[0-9]|3[0-1]"
Last modified date: 11/28/2025