Query returning duplicate results

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

jdickins90
Posts: 1
Joined: Tue Jun 07, 2022 9:37 am

Query returning duplicate results

Post by jdickins90 » Tue Jun 07, 2022 9:56 am

Hi, I'm trying to understand why a query I'm running is returning duplicates. I'm trying to return a list of EP publications with publication ID, address, and NUTS code with the following query:

Code: Select all

USE SELECT tls211_pat_publn.pat_publn_id, p.person_name, p.person_address, p.nuts
FROM tls206_person p
JOIN tls227_pers_publn on p.person_id = tls227_pers_publn.person_id
JOIN tls211_pat_publn on tls227_pers_publn.pat_publn_id = tls211_pat_publn.pat_publn_id
JOIN tls207_pers_appln on tls227_pers_publn.person_id = tls207_pers_appln.person_id
JOIN tls224_appln_cpc on tls224_appln_cpc.appln_id = tls207_pers_appln.appln_id
JOIN (select nuts, nuts_label, nuts_level from tls904_nuts) nuts on nuts.nuts = p.nuts
    WHERE publn_auth = 'ep'
        AND publn_kind = 'B1' 
	AND YEAR(publn_date) = 2019
	AND p.nuts LIKE 'UK%'
	AND tls224_appln_cpc.cpc_class_symbol LIKE 'C%'
The query returns a list with duplicate publication IDs. My question is in what circumstances would there be duplicate publication IDs in the tls211 table? Is it due to multiple persons on each publication? I'm ideally looking for a list of unique publications so any advice would be appreciated.


EPO / PATSTAT Support
Posts: 433
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Query returning duplicate results

Post by EPO / PATSTAT Support » Wed Jun 15, 2022 4:44 pm

When joining tables in relational databases, the data base basically "de-normalizes" the data which results into duplicates.
An EP-B1 publication (what you used) is indeed a unique publication. But if you link it to the CPC table, you will get a new row for each CPC code assigned to that application. The same for applicants, more applicants means more rows. So 1 publication with 2 CPC codes and 3 applicants creates 6 rows.
To solve this you can use "DISTINCT" or "GROUP BY" to remove the duplicates (as long as you don't SELECT the names and the CPC codes).
So your query would become something like this:

Code: Select all

SELECT distinct tls211_pat_publn.pat_publn_id, p.person_name, p.person_address, p.nuts
FROM tls206_person p
JOIN tls227_pers_publn on p.person_id = tls227_pers_publn.person_id
JOIN tls211_pat_publn on tls227_pers_publn.pat_publn_id = tls211_pat_publn.pat_publn_id
JOIN tls207_pers_appln on tls227_pers_publn.person_id = tls207_pers_appln.person_id
JOIN tls224_appln_cpc on tls224_appln_cpc.appln_id = tls207_pers_appln.appln_id
JOIN (select nuts, nuts_label, nuts_level from tls904_nuts) nuts on nuts.nuts = p.nuts
    WHERE publn_auth = 'ep'
        AND publn_kind = 'B1' 
	AND YEAR(publn_date) = 2019
	AND p.nuts LIKE 'UK%'
	AND tls224_appln_cpc.cpc_class_symbol LIKE 'C%'
	order by tls211_pat_publn.pat_publn_id
You can see that your result is largely reduced and now only has distinct different rows. The duplicates are due to the fact that a publication is linked to multiple "persons: inventors and applicants" that are taken from the tls206_person table. Further reduction is not possible unless you create stricter conditions such as selecting only the applicants (excluding inventors). But that depends on what data you need for your research.
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


Post Reply