The database design is far from optimum but I have to deal with it and now I'm really stuck.
Edit: I'm using cx_Oracle
OK so this is my query:
query="select degree, spectraldev.event.eventnumber \
from spectraldev.degree \
join spectraldev.alignment on \
(spectraldev.version_id = alignment.version_id) \
join spectraldev.event on \
(alignment.timestamp between event.eventstart and event.eventstop) \
join spectraldev.eventsetup on \
(spectraldev.event.eventsetup = spectraldev.eventsetup.oid) \
where spectraldev.event.eventnumber>=" + options.start + " AND spectraldev.event.eventnumber<=" + options.stop + " AND \
HITS>=" + str(options.minimum_hits)+" \
order by spectraldev.event.eventnumber"
db_cursor.execute(query)
Which returns a bunch of degree
s (12.34 etc.) for many events, which are identified by a unique number (eventnumber
like 346554).
So I get a table like this:
454544 45.2
454544 12.56
454544 41.1
454544 45.4
454600 22.3
454600 24.13
454600 21.32
454600 22.53
454600 54.51
454600 33.87
454610 32.7
454610 12.99
And so on…
Now I need to create a dictionary with the average degree for each event (summing up all corresponding floats and dividing by the number of them).
I think this could be done in SQL but I just can't get it work. At the moment I'm using python to do this, but the fetch command takes 1-2 hours to complete about 2000 Events, which is far too slow, since I need to process about 1000000 events.
This is my fetching part, which takes so long:
_degrees = []
for degree, eventNumber in cursor.fetchall():
_degrees.append([eventNumber, degree])
and then sorting (which is really fast, < 1sec) and calculating averages (also really fast):
_d={}
for eventNumber, degree in _degrees:
_d.setdefault(eventNumber, []).append(degree)
for event in events:
_curDegree = _degrees[int(event)]
_meanDegree = sum(_curDegree) / float(len(_curDegree))
meanDegrees.append(_meanDegree)
Is there a way to do the python part in SQL?
This is an aside, but an important one. You're wide open to SQL Injection. It may not matter in your particular instance but it's best to always code for the worst.
You don't mention what module you're using but assuming it's something that's PEP 249 compliant (you're probably using cx_Oracle) then you can pass a dictionary with the named bind parameters. A typical query might look like this:
On your actual query you're converting some variables (
options.start
for instance) to a string in Python, but not quoting them in SQL, which means they're being implicitly converted back to a number. This almost definitely isn't needed.In relation to your actual problem 1-2 hours to complete 2,000 events is, you're correct, ridiculous. You haven't posted a schema but my guess is you're lacking any indexes.
To get the average number of degrees per event number you should use the
avg()
function. This would make your query:I've formatted your query to make it slightly more readable (from my point of view) and to make it more obvious where you need the indexes.
Judging by this you need an index on the following tables and columns;
eventnumber
,eventstart
,eventstop
,eventsetup
version_id
version_id
,tstamp
oid
and wherever
hits
might be.Having said all that your problem may be the indexes. You haven't provided your explain plan or the schema, or the number of rows so this is going to be a guess. However, if you're selecting a significant proportion of the rows in a table the CBO may be using the indexes when it shouldn't. Forcing a full table scan using the full hint,
/*+ full(event) */
for instance, may solve your problem.Removing the
order by
, if it's not required may also significantly speed up your query.