Page 1 of 1

PATSTAT Online: use of STRING_AGG(...) and aliases problem

Posted: Tue Jul 14, 2020 1:11 pm
by EPO / PATSTAT Support
Many PATSTAT Online users ask how to create "lists" that "group" data which is split over a number of different tables. PATSTAT is a relational data base, and -as you might expect- most of the data is what we call "normalised" in order to reduce data redundancy and improve data integrity. A disadvantage is that when users "JOIN" tables, the result list will be a list of "normalised data items", but from a point of "human readability" it feels unnecessary to have the data split into many rows, when our brain can perfectly understand groups of data.
To solve the problem the T-SQL STRING_AGG function was created which takes all expressions from rows and concatenates them into a single string. However, a problem of duplicate data occurs when the query joins a number of tables that are not included in the STRING_AGG function. The STRING_AGG function allows to sort the data items, but does not have a "distinct" function to de-duplicate. To solve that issue, we bring the STRING_AGG function in a sub query which will create a de-duplicated data attribute. PATSTAT Online has a problem with ALIAS renaming of sub queries, so you need to address the attributes in the SELECT clause without the reference to table created by the sub query. (it's a little bug in the third-party parser).
The code: here is a simple query that gives the data on 6 EP applications according to a number of criteria. It results in 172 unique rows of data, giving all the applicants, inventors and CPC codes.

Code: Select all

SELECT distinct a.appln_id,a.appln_auth ,a.appln_nr, a.earliest_filing_date, a.nb_applicants,a.nb_inventors, a.docdb_family_id, 
a.docdb_family_size, applt_seq_nr, invt_seq_nr,
psn_name, cpc_class_symbol
FROM tls201_appln a
join tls207_pers_appln pa on a.appln_id = pa.appln_id
join tls206_person p on pa.person_id = p.person_id 
join tls224_appln_cpc CPC on a.appln_id = cpc.appln_id
WHERE (a.appln_id in (select appln_id from tls224_appln_cpc where left(cpc_class_symbol,8) = 'Y02T  10'))
and  a.earliest_filing_date  = '2003-11-19'
and a.appln_auth = 'EP'
ORDER BY a.appln_id
And this is the query when the same data is "grouped", joining the applicants, inventors and CPC codes in 3 seperate columns - resulting in 6 rows of data.

Code: Select all

SELECT DISTINCT a.appln_id,a.appln_auth ,a.appln_nr, a.earliest_filing_date, a.nb_applicants,a.nb_inventors, a.docdb_family_id, 
a.docdb_family_size, 
APPLICANT_GROUP,
INVENTOR_GROUP,
CPC_GROUP
FROM tls201_appln a
left join (select pa.appln_id ,STRING_AGG (cast((psn_name+' ['+person_ctry_code+']'+' ['+psn_sector+']' )as NVARCHAR(MAX)), '¦ ') APPLICANT_GROUP
FROM  tls207_pers_appln pa join tls206_person pap on pa.person_id = pap.person_id where pa.applt_seq_nr > 0  group by pa.appln_id) applicants
ON a.appln_id = applicants.appln_id
left join (select pa.appln_id ,STRING_AGG (cast((psn_name+' ['+person_ctry_code+']'+' ['+psn_sector+']' )as NVARCHAR(MAX)), '¦ ') INVENTOR_GROUP
FROM  tls207_pers_appln pa join tls206_person pap on pa.person_id = pap.person_id where pa.invt_seq_nr > 0 group by pa.appln_id) inventors
on a.appln_id = inventors.appln_id
join (select appln_id ,STRING_AGG (cast((cpc_class_symbol)as NVARCHAR(MAX)), '¦ ') WITHIN GROUP (ORDER BY cpc_class_symbol asc) CPC_GROUP
FROM  tls224_appln_cpc group by appln_id) CPC
ON a.appln_id = CPC.appln_id
where (a.appln_id in (select appln_id from tls224_appln_cpc where left(cpc_class_symbol,8) = 'Y02T  10'))
and  a.earliest_filing_date  = '2003-11-19'
and a.appln_auth = 'EP'
ORDER BY a.appln_auth,a.earliest_filing_date
(User having a local PATSTAT Global installation with WRITE access will rather use temporary tables or views instead of the above sub-query approach.)