Dataset Detection on File Systems¶
Introduction¶
The purpose of this document is to detail the rules applied in File System-type Zeenea connector to detect Datasets.
The algorithm checks all the objects from the root path and identifies if they are datasets. Once a folder is identified as a dataset, the algorithm stops its treatment and goes to the next folder.
Folder Containing Only Files¶
Rule 1¶
A folder is a dataset when it only contains files and when at least one file has a supported extension.
Supported extensions are: csv, parquet, orc, xml, json, avro
![]() | ![]() |
|---|---|
| * The "Client" folder is a dataset (rule 1). | * The "Project" folder is considered a unique dataset (rule 1). |
| * The folder contains only files, and at least one of them has a supported extension. | * The folder contains only files, and at least one of them has a supported extension. |
| * Zeenea will extract the Schema from the most recent file (in this case, Client20190827.csv). | * Zeenea will extract the schema from the most recent file. |
| * If the files are not homogenous, the documented schema may then change upon the analysis treatment. |
Folder with Subfolders¶
Rule 2¶
When a folder contains a sub-folder, whose name doesn't follow partition naming conventions, that folder is not a dataset.
Rule 3¶
A file may be a dataset when it is within a folder containing sub-folders. The file must however possess a supported extension.
![]() | ![]() |
|---|---|
| * The "Client" folder is not a dataset (rule 2). | * The "Client" folder is not a dataset (rule 2). |
| * It contains subfolders PP and PM, whose names do not follow dataset naming convention. | * It contains subfolder PP, whose name does not follow dataset naming convention. |
| * Sub-folders PP and PM are, however, datasets (rule 1). They only contain files. | * The "PP" folder is a dataset (rule 1). It only contains files. |
| * Files "Client20190225.csv" and "Client20190226.csv" are datasets (rule 3). * Files with a supported extension. |
Folder with Partitions¶
Rule 4¶
A folder is a dataset when:
- All its subfolders' names follow the partition naming convention.
- At least one of its subfolders would be considered as a dataset had it been isolated.
![]() | ![]() |
|---|---|
| * The "Client" folder is a dataset (rule 4). | * The "Client" Folder is not a dataset (rule 2). |
| * "Client" contains subfolders "2019" and "2018", whose names follow the dataset naming convention. | * It contains subfolders "PP" and "2019", whose names do not follow the dataset naming convention. |
| * "2019" would have been a dataset, had it not been contained within "Client" (rule 3). | * Sub-Folder 2019 is a dataset (rule 4). Its subfolder would be considered a dataset had it been isolated. |
| * "2019" contains subfolders "05" and "08", whose names follow the dataset naming convention. | * Sub-folder "PP" is a dataset (rule 1). It only contains files. |
| * "08" would have been a dataset, had it not been contained within "2019" | |
| * "08" only contains files, of which at least has a supported extension. |
Partition Naming Convention¶
Subfolders are considered partitions if their names match this regular expression:
"(.*=.*)", "[0-9]{8}", "[0-9]{4}", "[0-9]{2}", "0?[1-9]|1[012]", "0?[1-9]|1[0-9]|2[0-9]|3[0-1]"





