Duplicate Application ID

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

AthimetC
Posts: 2
Joined: Tue Feb 16, 2021 5:39 am

Duplicate Application ID

Post by AthimetC » Tue Feb 16, 2021 5:50 am

Hi there,

I have a question about the SQL code below:

Code: Select all

SELECT a.appln_id, a.appln_kind, a.ipr_type, a.granted, a.appln_filing_date, a.nb_citing_docdb_fam,
t.appln_title,
ab.appln_abstract,
c.ipc_class_symbol,
ap.psn_name, ap.psn_id, ap.person_ctry_code, ap.psn_sector,
pub.publn_date
FROM tls201_appln a
JOIN tls203_appln_abstr ab ON a.appln_id = ab.appln_id
JOIN tls202_appln_title t ON a.appln_id = t.appln_id
JOIN tls209_appln_ipc c ON a.appln_id = c.appln_id
JOIN tls211_pat_publn pub ON a.appln_id = pub.appln_id
JOIN tls212_citation ci ON pub.pat_publn_id = ci.pat_publn_id
JOIN tls207_pers_appln ON a.appln_id = tls207_pers_appln.appln_id
JOIN tls206_person ap ON tls207_pers_appln.person_id = ap.person_id
WHERE c.ipc_class_symbol LIKE 'A01G%' -- Change IPC symbol here
-- AND a.appln_filing_date BETWEEN '1980-01-01' AND '2020-12-31'
-- AND a.appln_filing_date BETWEEN '1980-01-01' AND '2012-12-31' 
AND a.appln_filing_date BETWEEN '2013-01-01' AND '2020-12-31'
AND tls207_pers_appln.applt_seq_nr > 0 -- i.e. applicants only
AND ap.psn_sector LIKE 'COMPANY'
ORDER BY a.appln_filing_year, a.appln_id
I found that there are many application ID numbers which are duplicate, is it correct?

Please help,

Best

Athimet


EPO / PATSTAT Support
Posts: 425
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Duplicate Application ID

Post by EPO / PATSTAT Support » Tue Feb 16, 2021 6:26 pm

Hello Athimet,
from a syntax point of view, your query is correct.
From a logical & data aggregation point of view, there is room for improvement.

There are duplicate records because of the combination of tables you have joined in the FROM clause in your query. You should in principle not join any tables that contain data which you will not use in your SELECT clause or which you don't need in your result list. In your case, you joined tls212_citation but you don't use any of the data in that table, be it for your result list or your conditions in the WHERE clause. I assume you don't need it, and if so you can remove it.

Example: the joining with the tls209_appln_ipc table will result in multiple rows for the same application, each having an IPC starting with 'A01G%'. So that could be 'A01G 1/00', 'A01G 9/00','A01G 27/02', ...
Questoin is if you need in your result all those classification codes. A01G covers AGRICULTURE, maybe it is sufficient that you know that the patents in your result have been classified in that area without needing every A01G classification code in your result as a new row.

SQL will in principle create a new row based on a cartesian product of all the tables and attributes in your SELECT clause. Using a DISTINCT will already remove duplicate rows, but you have to decide what data exactly you need for your research, and then remove what is not needed.

Below is your query but I added a condition to limit the result to all records from 1 single application.

Code: Select all

SELECT distinct  a.appln_id, a.appln_kind, a.ipr_type, a.granted, a.appln_filing_date, a.nb_citing_docdb_fam,
t.appln_title,
ab.appln_abstract,
c.ipc_class_symbol,
ap.psn_name, ap.psn_id, ap.person_ctry_code, ap.psn_sector,
pub.publn_date, nb_applicants
FROM tls201_appln a
JOIN tls203_appln_abstr ab ON a.appln_id = ab.appln_id
JOIN tls202_appln_title t ON a.appln_id = t.appln_id
JOIN tls209_appln_ipc c ON a.appln_id = c.appln_id
JOIN tls211_pat_publn pub ON a.appln_id = pub.appln_id
JOIN tls212_citation ci ON pub.pat_publn_id = ci.pat_publn_id
JOIN tls207_pers_appln ON a.appln_id = tls207_pers_appln.appln_id
JOIN tls206_person ap ON tls207_pers_appln.person_id = ap.person_id
WHERE c.ipc_class_symbol LIKE 'A01G%' -- Change IPC symbol here
-- AND a.appln_filing_date BETWEEN '1980-01-01' AND '2020-12-31'
-- AND a.appln_filing_date BETWEEN '1980-01-01' AND '2012-12-31'
AND a.appln_filing_date BETWEEN '2013-01-01' AND '2020-12-31'
AND tls207_pers_appln.applt_seq_nr > 0 -- i.e. applicants only
AND ap.psn_sector LIKE 'COMPANY'
and a.appln_id = 412911020
ORDER BY a.appln_filing_date, a.appln_id
The result will be 6 different rows - but all for the same single patent. They originate from the 2 applicants and the 3 IPC codes --> 1*2*3 = 6 rows.

Depending on what data you need in your final table, you could adapt the query to get less rows - and less data.
Here is an example where I have moved the IPC table from the FROM clause into the WHERE clause.
This results now in 2 rows, one for each applicant.

Code: Select all

SELECT distinct  a.appln_id, a.appln_kind, a.ipr_type, a.granted, a.appln_filing_date, a.nb_citing_docdb_fam,
t.appln_title,
ab.appln_abstract,
ap.psn_name, ap.psn_id, ap.person_ctry_code, ap.psn_sector,
pub.publn_date, nb_applicants
FROM tls201_appln a
JOIN tls203_appln_abstr ab ON a.appln_id = ab.appln_id
JOIN tls202_appln_title t ON a.appln_id = t.appln_id
JOIN tls209_appln_ipc c ON a.appln_id = c.appln_id
JOIN tls211_pat_publn pub ON a.appln_id = pub.appln_id
JOIN tls207_pers_appln ON a.appln_id = tls207_pers_appln.appln_id
JOIN tls206_person ap ON tls207_pers_appln.person_id = ap.person_id
WHERE 
a.appln_id in (select appln_id from tls209_appln_ipc where ipc_class_symbol LIKE 'A01G%') -- Change IPC symbol here
AND a.appln_filing_date BETWEEN '2013-01-01' AND '2020-12-31'
AND tls207_pers_appln.applt_seq_nr > 0 -- i.e. applicants only
AND ap.psn_sector LIKE 'COMPANY'
and a.appln_id = 412911020
ORDER BY a.appln_filing_date, a.appln_id
If you want to further reduce it to 1 row for that application, then you will need to remove the applicant names, or group them together into 1 single attribute making use of for example the STRING_AGG function.
As explained here: help-with-sql-query-how-to-retrieve-nam ... nt-s-9499
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


AthimetC
Posts: 2
Joined: Tue Feb 16, 2021 5:39 am

Re: Duplicate Application ID

Post by AthimetC » Wed Feb 17, 2021 7:58 am

Dear Support Team

Thank you for your help. Using SELECT DISTINCT is really helpful and reduce a lot of data.

Best

Athimet


Post Reply