My pyROOT analysis code is using huge amounts of memory. I have reduced the problem to the example code below:
from ROOT import TChain, TH1D
# Load file, chain
chain = TChain("someChain")
inFile = "someFile.root"
chain.Add(inFile)
nentries = chain.GetEntries()
# Declare histograms
h_nTracks = TH1D("h_nTracks", "h_nTracks", 16, -0.5, 15.5)
h_E = TH1D("h_E","h_E",100,-0.1,6.0)
h_p = TH1D("h_p", "h_p", 100, -0.1, 6.0)
h_ECLEnergy = TH1D("h_ECLEnergy","h_ECLEnergy",100,-0.1,14.0)
# Loop over entries
for jentry in range(nentries):
# Load entry
entry = chain.GetEntry(jentry)
# Define variables
cands = chain.__ncandidates__
nTracks = chain.nTracks
E = chain.useCMSFrame__boE__bc
p = chain.useCMSFrame__bop__bc
ECLEnergy = chain.useCMSFrame__boECLEnergy__bc
# Fill histos
h_nTracks.Fill(nTracks)
h_ECLEnergy.Fill(ECLEnergy)
for cand in range(cands):
h_E.Fill(E[cand])
h_p.Fill(p[cand])
where someFile.root is a root file with 700,000 entries and multiple particle candidates per entry.
When I run this script it uses ~600 MB of memory. If I remove the line
h_p.Fill(p[cand])
it uses ~400 MB.
If I also remove the line
h_E.Fill(E[cand])
it uses ~150 MB.
If I also remove the lines
h_nTracks.Fill(nTracks)
h_ECLEnergy.Fill(ECLEnergy)
there is no further reduction in memory usage.
It seems that for every extra histogram that I fill of the form
h_variable.Fill(variable[cand])
(i.e. histograms that are filled once per candidate per entry, as opposed to histograms that are just filled once per entry) I use an extra ~200 MB of memory. This becomes a serious problem when I have 10 or more histograms because I am using GBs of memory and I am exceeding the limits of my computing system. Does anybody have a solution?
Update: I think this is a python3 problem.
If I take the script in my original post (above) and run it using python2 the memory usage is ~200 MB, compared to ~600 MB with python3. Even if I try to replicate Problem 2 by using the long variable names, the job still only uses ~200 MB of memory with python2, compared to ~1.3 GB with python3.
During my Googling I came across a few other accounts of people encountering memory leaks when using pyROOT with python3. It seems this is still an issue as of Python 3.6.2 and ROOT 6.08/06, and that for the moment you must use python2 if you want to use pyROOT.
So, using python2 appears to be my "solution" for now, but it's not ideal. If anybody has any further information or suggestions I'd be grateful to hear from you!
I'm glad you figured out Python3 was the problem. But if you (or anyone) continues to have memory usage issues when working with histograms in the future, here are some potential solutions I hope you'll find helpful!
THnSparse
Use
THnSparse
--THnSparse
is an efficient multidimensional histogram that shows its strengths in histograms where only a small fraction of the total bins are filled. You can read more about it here.TTree
TTrees
are data structures in ROOT that are quite frankly glorified tables. However, they are highly optimized. ATTree
is composed ofbranches
andleaves
that contain data that, through ROOT, can be speedily and efficiently accessed. If you put your data into aTTree
first, and then read it into a histogram, I guarantee you will find lower memory usage and higher run times.Here is some example
TTree
code.And here's a resource where you can learn about how fantastic
TTree
s are:TTree ROOT documentation
For more reading, here is a discussion on speeding up ROOT historgram building on the CERN help forum:
Memory-conservative histograms for usage in DQ monitoring
Best of luck with your data analysis, and happy coding!