Skip to content Skip to sidebar Skip to footer

How To Select A Subset From A Multi-index Dataframe Based On Conditions From Another Dataframe

I have a dataframe like below: dates 0 numbers letters 0 a 2013-01-01 0.261092 2013-01-02 -1.267770

Solution 1:

You can also use index intersection:

In[39]: lOut[39]:
   numbersletters00b11cIn[40]: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
Out[40]:
                      dates0numbersletters0b2013-01-01-1.515866b2013-01-020.351942b2013-01-03-0.2454631c2013-01-01-0.108325c2013-01-02-0.212350c2013-01-030.021097c2013-01-01-0.419775c2013-01-021.511700c2013-01-030.994332

Or more straightforward and faster solution from @Javier:

In [155]:df.loc[l.set_index(['numbers','letters']).index]Out[155]:dates0numbersletters0b2013-01-01 -1.515866b2013-01-02  0.351942b2013-01-03 -0.2454631c2013-01-01 -0.108325c2013-01-02 -0.212350c2013-01-03  0.021097c2013-01-01 -0.419775c2013-01-02  1.511700c2013-01-03  0.994332

Timing:

for 27.000 rows Multi-Index DF

In [156]: df = pd.concat([df.reset_index()] * 10**3, ignore_index=True).set_index(['numbers','letters'])

In [157]: df.shape
Out[157]: (27000, 2)

In [158]: %%timeit
     ...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
     ...:             axis=1) \
     ...:      .str.cat(sep=' or ')
     ...: df.query(q)
     ...:
10 loops, best of 3: 21.3 ms per loop

In [159]: %%timeit
     ...: df.loc[l.set_index(['numbers','letters']).index]
     ...:
10 loops, best of 3: 20.2 ms per loop

In [160]: %%timeit
     ...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
     ...:
10 loops, best of 3: 27.2 ms per loop

for 270.000 rows Multi-Index DF

In [163]: %%timeit
     ...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
     ...:             axis=1) \
     ...:      .str.cat(sep=' or ')
     ...: df.query(q)
     ...:
10 loops, best of 3: 117 ms per loop

In [164]: %%timeit
     ...: df.loc[l.set_index(['numbers','letters']).index]
     ...:
1 loop, best of 3: 142 ms per loop

In [165]: %%timeit
     ...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
     ...:
10 loops, best of 3: 185 ms per loop

Conclusion:df.query() method which uses numexpr module internaly seems to be faster for bigger DFs

Solution 2:

assuming you have the following DF with the values that you want to get:

In[28]: lOut[28]:
   numbersletters00b11c

if you need to select all rows where numbers is either 0 or 1 and letters is in ['b','c'] you can use df.query() method as follows:

In [29]:df.query("numbersin@l.numbersandlettersin@l.letters")Out[29]:dates0numbersletters0b2013-01-01 -1.515866b2013-01-02  0.351942b2013-01-03 -0.245463c2013-01-01 -0.253103c2013-01-02 -0.385411c2013-01-03 -1.7408211c2013-01-01 -0.108325c2013-01-02 -0.212350c2013-01-03  0.021097b2013-01-01 -1.922214b2013-01-02 -1.769003b2013-01-03 -0.594216c2013-01-01 -0.419775c2013-01-02  1.511700c2013-01-03  0.994332

or simply:

df.query("numbers in [0,1] and letters in ['b','c']")

UPDATE: if it must be exact matches, like (0, 'b') and (1, 'c'):

In [14]:q=l.apply(lambdar:"(numbers == {} and letters == '{}')".format(r.numbers,r.letters),...:axis=1)\...:.str.cat(sep='or')
    ...:

In [15]: q
Out[15]: "(numbers == 0 and letters == 'b')or(numbers==1andletters=='c')"In [16]:df.query(q)Out[16]:dates0numbersletters0b2013-01-01 -1.515866b2013-01-02  0.351942b2013-01-03 -0.2454631c2013-01-01 -0.108325c2013-01-02 -0.212350c2013-01-03  0.021097c2013-01-01 -0.419775c2013-01-02  1.511700c2013-01-03  0.994332

Post a Comment for "How To Select A Subset From A Multi-index Dataframe Based On Conditions From Another Dataframe"