Hello Richard,
table TLS214_NPL_PUBLN can be linked to TLS212_CITATION via the attribute NPL_PUBLN_ID.
Here is a query that extract all citation information (exept the citation categories) from a single patent applications.
Code: Select all
SELECT citing.appln_id, citing.publn_auth, citing.publn_nr, citing.publn_kind, citing.publn_date, tls212_citation.*,
cited.publn_auth, cited.publn_nr, cited.publn_kind, cited.publn_date, tls214_npl_publn.*
FROM tls211_pat_publn citing join tls212_citation on citing.pat_publn_id = tls212_citation.pat_publn_id
join tls211_pat_publn cited on tls212_citation.cited_pat_publn_id = cited.pat_publn_id
join tls214_npl_publn on tls212_citation.cited_npl_publn_id = tls214_npl_publn.npl_publn_id
join tls201_appln citing_app on citing.appln_id = citing_app.appln_id
where citing_app.appln_nr_epodoc = 'WO2002US35880'
--citing.publn_kind = 'a1' and citing.publn_auth = 'ep' and citing.appln_id = 16011356
order by citing.publn_date desc, citing.appln_id
Your query will generate an enormous amount of data because you have hardly any restrictions and a large number of joined tables. Each table that you join in a query (mostly) multiplies the number of records. You are trying to re-create a data set, but the data set will be de-normalised (because of all the joins), and it becomes in principle un-workable because of the size. For example, joining the TLS209_APPN_IPC table creates a new record (row in your table) for each existing row times the number of IPC classification codes for that application.
You can avoid this by using the STRING_AGG function, but the questions is whether you really need all those IPC codes in your final data set. The same question goes for the names of the applicants and the inventors, do you need all the names of each single inventor and applicant? (It multiplies your data set.)
The table TLS226_PERSON_ORIG is probably not needed in your query, the address is already in TLS206.
The TLS215_CITN_CATEG has to be linked to the cited publications, not the citing publication.
Here is a reworked/corrected query where I run the query to illustrate the above for 1 single application. It returns 36 records !
If you run this query for all applications where the "appln_filing_year BETWEEN 1998 AND 2018" (54 million applications), you can see that the amount of data will spiral out of control.
Code: Select all
SELECT DISTINCT a.appln_filing_year, a.appln_id, a.appln_auth, a.appln_nr, a.earliest_publn_year, a.granted, a.nb_citing_docdb_fam,
r.appln_title ,
t.pat_publn_id, t.publn_auth, t.publn_date, t.publn_first_grant, t.publn_nr,
s.citn_origin, s.citn_replenished, s.npl_citn_seq_nr, s.pat_citn_seq_nr
,s.cited_appln_id --,
,s.cited_pat_publn_id
,b.citn_categ, b.citn_id, b.citn_replenished, b.pat_publn_id
,STRING_AGG (ipc_class_symbol, ', ') IPC_symbols
,q.psn_name, q.psn_id, c.ctry_code
FROM tls201_appln a
JOIN tls209_appln_ipc i ON a.appln_id = i.appln_id
JOIN tls202_appln_title r ON a.appln_id = r.appln_id
JOIN tls211_pat_publn t ON a.appln_id = t.appln_id
JOIN tls212_citation s ON t.pat_publn_id = s.pat_publn_id
JOIN tls215_citn_categ b ON s.pat_publn_id = b.pat_publn_id and s.citn_replenished = b.citn_replenished and s.citn_id=b.citn_id
JOIN tls207_pers_appln o ON a.appln_id = o.appln_id
JOIN tls206_person q ON o.person_id = q.person_id
JOIN tls801_country c ON c.ctry_code = q.person_ctry_code
WHERE a.appln_id = 848
group by a.appln_filing_year, a.appln_id, a.appln_auth, a.appln_nr, a.earliest_publn_year, a.granted, a.nb_citing_docdb_fam,
r.appln_title ,
t.pat_publn_id, t.publn_auth, t.publn_date, t.publn_first_grant, t.publn_nr,
s.citn_origin, s.citn_replenished, s.npl_citn_seq_nr, s.pat_citn_seq_nr
,s.cited_appln_id --,
,s.cited_pat_publn_id
,q.psn_name, q.psn_id, c.ctry_code
,b.citn_categ, b.citn_id, b.citn_replenished, b.pat_publn_id
ORDER BY a.appln_filing_year, a.appln_id, t.pat_publn_id, pat_citn_seq_nr
Sollution:
You have to get back to drawing board and see what data you really need for your research, and most probably, you should try to do any "counts" directly on the main data base instead of extracting an enormous data set for further aggregation.