How to preprocess invalid CSV in a canonical way #1835

alexeyabel · 2022-09-06T19:44:22Z

alexeyabel
Sep 6, 2022

What is the canonical way of fixing a CSV file with illegal syntax and then continue working with it? I cant' use type: pandas.CSVDataSet for it in the data catalog because parsing it would drop some illegal data.

So far I am using kedro.extras.datasets.text.TextDataSet and fix the raw string of the file in a node. But how should I create the next catalog entry. I tried telling the node to output it into a data entry of type: pandas.CSVDataSet but I get the error that str does not contain a to_csv attribute. Should I call pandas.read_csv()in my syntax fixing method manually? Or how do I add preprocssing steps to fix the faulty CSV?

noklam · 2022-09-06T19:48:49Z

noklam
Sep 6, 2022
Collaborator

What do you mean by illegal syntax? If it's not a valid csv file then you will just treat it as a normal text file.

In that case doing pd.read_csv within your node may not be too bad, it's is acting like a transformation logic (arguably a string2dataframe function) instead of I/O.

1 reply

alexeyabel Sep 7, 2022
Author

What do you mean by illegal syntax? If it's not a valid csv file then you will just treat it as a normal text file.

Some of the (logical CSV) lines are split into multiple (file) lines and I need to combined into one line first. If I just treat it as a CSV and delcare a data entry with CSVDataSet all but the first split line are recognized as belonging to the wrong column.

In that case doing pd.read_csv within your node may not be too bad, it's is acting like a transformation logic (arguably a string2dataframe function) instead of I/O.

I thought maybe there are hooks or other built-in functionality to handle invalid CSV files, since this is probably a common occurence - at least it was in my ML projects. I think I can do it "manually" with TextDataSet but assumed that Kedro developers already accomodated for that case, thus my question for a canonical solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to preprocess invalid CSV in a canonical way #1835

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

How to preprocess invalid CSV in a canonical way #1835

Uh oh!

Uh oh!

alexeyabel Sep 6, 2022

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

noklam Sep 6, 2022 Collaborator

Uh oh!

Uh oh!

alexeyabel Sep 7, 2022 Author

alexeyabel
Sep 6, 2022

Replies: 1 comment 1 reply

noklam
Sep 6, 2022
Collaborator

alexeyabel Sep 7, 2022
Author