Logging bad rows importing input CSV files in Postgres with Pgloader

by Btz   Last Updated October 16, 2018 13:06 PM

I'm creating a pgloader command file to import a set of CSV files supplied by another team in a table in a Postgres database. Loading well formed files woks fine, but our input files might be badly formatted. Ideally I'd like to skip and log all bad rows in my input files; the documentation says:

To enable retrying the data and loading the good parts, use the option on error resume next, which is the default to file based data loads (such as CSV, IXF or DBF).

but I'm not quite able to get it to work.

This is my command file:

LOAD CSV
  FROM
    ALL FILENAMES MATCHING ~/{{{FILE_REGEXP}}}/
    IN DIRECTORY '{{STAGING_PATH}}' 
    WITH ENCODING UTF8
  INTO postgresql://{{PGUSER}}:{{PGPASSWORD}}@{{PGHOST}}/{{PGDATABASE}}
    TARGET TABLE {{TABLE_NAME}}

  WITH
    csv header,
    on error resume next,
    fields terminated by '|',
    truncate
;

I have a couple of files where I inserted errors on purpose for testing purposes. Here's a part of the first:

"id"|"code"|"first_name"|"first_name_1"|"last_name" "jba9ltm3rk58e9bg"|"O1smr83CmvMkqnrV"|"Adam"|"Ant" "wew6tiq5zmkak19y","I4hlQYrcCSlN3KnV"|Barbara|"Barb"|Bell "mtsunn0i0ac2t30o"|"vePl6Nti7q1s9YeI"|"Carl"|"Chip"|"Cook"

And a part of the second:

"id"|"code"|"first_name"|"first_name_1"|"last_name" "zcwoyy9e2yiuvan3"|"KNK59muKY5AcfQsd"|"Edward"|""|"Ember" "fvdagry1jbbgtv63","bJnYxUUHo1R3rlYK","Faith","Faith","Flint" "k606tkzte4t8mjv2"|"qskHKqVvwbPRE0xY"|"Gareth"|"George"|"Gatsby"

At the first error, regardless of whether I use on error resume next or on error stop, pgloader stops processing the current file and moves on to the next. For example I never see Carl Cook and Gareth Gatsby in my database.

The documentation says this about logging bad rows:

At the end of a load containing rejected rows, you will find two files in the root-dir location, under a directory named the same as the target database of your setup. The filenames are the target table, and their extensions are .dat for the rejected data and .log for the file containing the full PostgreSQL client side logs about the rejected data.

In my case, Pgloader creates a directory named after my target table in the output directory, but there is nothing in it, even though there clearly are bad rows in my input files.

I also tried adding batch rows = 1 in the hopes pgloader would move on to the next batch of one row, but there was no change in the output.

I'm probably doing something wrong. What can I check?

Tags : postgresql


Related Questions


Postgres gist slow index f_unaccent

Updated April 09, 2018 20:06 PM