Data import error of tls206_part02.csv

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

frankcpp
Posts: 2
Joined: Mon Mar 29, 2021 3:33 pm

Data import error of tls206_part02.csv

Post by frankcpp » Mon Mar 29, 2021 3:44 pm

Dear all,

Environment:
PATSTAT Autumn 2020
MS SQL Server 2019 on Windows 10

I used the build-in scripts (createDatabaseSQLserver.sql and alltablescreate.sql) to create the database and tables. When I imported tls206_part02.csv, I encountered some problems. My command and error log are provided as follows. I will be very grateful for your advice or help.

Table: dbo.tls206_person
Data: tls206_part02.csv

BULK INSERT dbo.tls206_person FROM 'D:\PATSTAT_Autumn_2020\data\tls206_part02.csv' WITH (FORMAT = 'CSV', FIRSTROW = 2, TABLOCK, ERRORFILE = 'F:\Error\tls206_part02.csv');

Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 8092915, column 2 (person_name).
Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 8179139, column 3 (person_name_orig_lg).
Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 8594950, column 3 (person_name_orig_lg).
Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 8684855, column 3 (person_name_orig_lg).
Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 9337528, column 3 (person_name_orig_lg).
Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 10067197, column 3 (person_name_orig_lg).
Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 10452815, column 3 (person_name_orig_lg).
Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 10470125, column 3 (person_name_orig_lg).
Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 10837299, column 2 (person_name).
Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 10862858, column 2 (person_name).
Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 10862902, column 2 (person_name).
Msg 4865, Level 16, State 1, Line 85
Cannot bulk load because the maximum number of errors (10) was exceeded.
Msg 7399, Level 16, State 1, Line 85
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 85
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".

Many thanks,
Frank


mkracker
Posts: 121
Joined: Wed Sep 04, 2013 6:17 am
Location: Vienna

Re: Data import error of tls206_part02.csv

Post by mkracker » Tue Mar 30, 2021 9:37 am

Hi Frank,

It is difficult to comment regarding data loading, because it depends on the tools and the configuration used. So I had a look at the data to see what's so special for the data you reported. The first error in your log file refers to row 8092915, column 2 (person_name). I checked this row in the csv file: It has the PERSON_ID = 58108440 and an unusual long person name in Cyrillic. I retrieved this record in my PATSTAT database:

select len(person_name) as NumberOfCharacters, datalength(person_name) as NumberOfBytes, *
from tls206_person
where person_id = 58108440

I see that the PERSON_NAME has 326 characters with 652 bytes. The CSV files are UTF-8 encoded, so Cyrillic characters to my knowledge need 2 bytes, ASCII characters (blanks, quotes, ..) need 1 byte.
I suppose your DB column PERSON_NAME is defined as nvarchar(500), as suggested in the PATSTAT Data Catalog.
So I assume that your data import does not handle UTF-8 encodings correctly.

You reported that you are using MS SQL Sever version 2019 - which I never used. In this version there are new UTF-8 collations, so things may work differently.

Can anybody else with more experience with data loading into MS SQL server 2019 give advise?
-------------------------------------------
Martin Kracker / EPO


frankcpp
Posts: 2
Joined: Mon Mar 29, 2021 3:33 pm

Re: Data import error of tls206_part02.csv

Post by frankcpp » Tue Mar 30, 2021 12:55 pm

Hi Martin,

I have uninstalled MS SQL 2019 and changed it to MS SQL 2017. However, the problems are still the same. I used the built-in scripts to create the database and tables.

The collation defined in createDatabaseSQLserver.sql is: SQL_Latin1_General_CP1_CI_AS. And I didn't change it.

My data import command is very simple: BULK INSERT dbo.tls206_person FROM 'D:\PATSTAT_Autumn_2020\data\tls206_part02.csv' WITH (FORMAT = 'CSV', FIRSTROW = 2, TABLOCK, ERRORFILE = 'F:\Error\tls206_part02.csv');

Could you please tell me which version you are using?

Thanks,
Frank


mkracker
Posts: 121
Joined: Wed Sep 04, 2013 6:17 am
Location: Vienna

Re: Data import error of tls206_part02.csv

Post by mkracker » Thu Apr 01, 2021 7:00 am

Hi Frank,

We work with an MS SQL Server 2017 database, from which we are extracting the CSV files,. So I myself have no hands-on experience in loading CSV files into the database.

Nevertheless, according to the documentation of the BULK INSERT https://docs.microsoft.com/en-us/sql/t- ... rver-ver15 you might need to add something like CODEPAGE = '65001' to your statement.

I hope this helps,
Martin
-------------------------------------------
Martin Kracker / EPO


adnan.gusic
Posts: 3
Joined: Thu May 06, 2021 1:50 pm

Re: Data import error of tls206_part02.csv

Post by adnan.gusic » Mon May 17, 2021 11:23 pm

Hi people,

I encountered an identical error while importing CSV tls206_part02.csv
Have you found any solution in the meantime?

Best regards,
Adnan


EPO / PATSTAT Support
Posts: 262
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Data import error of tls206_part02.csv

Post by EPO / PATSTAT Support » Tue May 18, 2021 3:27 pm

Hello Adnan,
here is the script that works fine on our MS SQL server to create the table: (from the scripts in the documentation):

Code: Select all

USE [patstat2021a]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[tls206_person](
	[person_id] [int] NOT NULL DEFAULT ('0'),
	[person_name] [nvarchar](500) NOT NULL DEFAULT (''),
	[person_name_orig_lg] [nvarchar](500) NOT NULL DEFAULT (''),
	[person_address] [nvarchar](1000) NOT NULL DEFAULT (''),
	[person_ctry_code] [char](2) NOT NULL DEFAULT (''),
	[nuts] [varchar](5) NOT NULL DEFAULT '',
	[nuts_level] [tinyint] NOT NULL DEFAULT ('9'),
	[doc_std_name_id] [int] NOT NULL DEFAULT ('0'),
	[doc_std_name] [nvarchar](500) NOT NULL DEFAULT (''),
	[psn_id] [int] NOT NULL DEFAULT ('0'),
	[psn_name] [nvarchar](500) NOT NULL DEFAULT (''),
	[psn_level] [tinyint] NOT NULL DEFAULT ('0'),
	[psn_sector] [varchar](50) NOT NULL DEFAULT (''),
	[han_id] [int] NOT NULL DEFAULT ('0'),
	[han_name] [nvarchar](500) NOT NULL DEFAULT (''),
	[han_harmonized] [int] NOT NULL DEFAULT ('0'),
PRIMARY KEY CLUSTERED 
(
	[person_id] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
And to bulk load data from the second part of the tls206_person file on the above table:

Code: Select all

BULK INSERT tls206_person
FROM '\\172.21.1.151\TempData\tls206_part02\tls206_part02.csv' 
WITH (FIRSTROW = 2,
    FIELDTERMINATOR = ',',
    ROWTERMINATOR='\n',
	CODEPAGE = '65001',
	FIELDQUOTE = '"',
	BATCHSIZE=250000,
	FORMAT = 'CSV',
    MAXERRORS=2);
This took about 2.5 minutes, no errors.
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


adnan.gusic
Posts: 3
Joined: Thu May 06, 2021 1:50 pm

Re: Data import error of tls206_part02.csv

Post by adnan.gusic » Thu May 20, 2021 10:41 pm

@EPO / PATSAT Support - Thank you very much for your fast reply.

I also use similar scripts for IMPORT, but in this case for the tls206_person table it didn't help me much. Here is what one row looked like in this particular CSV file:

Image

After a deep analysis I was able to solve this issue. There was a problem with Encoding, at the operating system level. The solution is as follows:

For Windows 10 operating systems: If you notice that Encoding or the display of special characters is incorrect (this especially applies to text fields) you must do the following:

Control Panel > Region > Tab Administrative
Hit button "Change system locale".
And check the checkbox labeled "Beta: Use Unicode UTF-8 for worldwide language support".
Hit the ok button. You will be asked to restart the computer.


After this change, the import went smoothly. And CSV records were displayed correctly.


Post Reply