Skip to Main Content

Database Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

OCFS2 Free space issue

783262Jul 12 2012 — edited Jul 12 2012
Hello,
I have an Oracle 11gR2 RAC on two Oracle Linux 5 nodes. For storage I use a couple of OCFS2 partitions.

I have deleted a database using dbca and also have deleted the remaining files from shared partitions. When I consulted the free space with df I was surprised to see that the partitions appear to have a lot of used space even that there are no files on them. I've seen a similar behavior on a gpfs partition on AIX. Even that the database was closed and deleted (and lsof doesn't return anything), and I don't see how some files could be still opened, I rebooted both nodes (complete shutdown and start after both were down).

Now I see this:

[root@node1 bin]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdf1 4.0G 283M 3.7G 8% /d01
/dev/sde1 2.0G 283M 1.7G 15% /d02
/dev/sdj1 1.0G 91M 934M 9% /d03
[root@node1 bin]# cd /d01
[root@node1 d01]# ls -a
[root@node1 d01]#

How is it possible to have used space on those partitions even if they are empty?
How can I do some space management on OCFS2 partitions?


[root@node1 d01]# modinfo ocfs2
filename: /lib/modules/2.6.32-300.27.1.el5uek/kernel/fs/ocfs2/ocfs2.ko
license: GPL
author: Oracle
version: 1.6.3
description: OCFS2 1.6.3
srcversion: B8AF3035CE0F00B763D188F
depends: ocfs2_stackglue,ocfs2_nodemanager
vermagic: 2.6.32-300.27.1.el5uek SMP mod_unload modversions

Thanks.

Comments

Jonathan Lewis

The first thing that strikes me as odd is that the operation is "partition list iterator" - as the second table of a nested loop where the driving table supplies the partition key that should be "'partition list single" (KEY/KEY).

I'd also be a little suspicious of the side effects of (a) FBI,(b) index fast full scan.

I think I'd have to do a little model to check if autolist does something different from "fixed" list.

Do any partitions have multiple partition key values, or are they all supposed to hold just one partition key ?  (If the latter, why

Regards

Jonathan Lewis

AndrewSayer

Jonathan Lewis wrote:

The first thing that strikes me as odd is that the operation is "partition list iterator" - as the second table of a nested loop where the driving table supplies the partition key that should be "'partition list single" (KEY/KEY).

I'd also be a little suspicious of the side effects of (a) FBI,(b) index fast full scan.

I think I'd have to do a little model to check if autolist does something different from "fixed" list.

Do any partitions have multiple partition key values, or are they all supposed to hold just one partition key ? (If the latter, why

Regards

Jonathan Lewis

Thanks Jonathan,

I agree that "iterator" does look suspicious, but I believe this is normal. On my 18.3 home sandbox, using standard manual list partitioning:

CREATE TABLE "AS_LIST_TABLE"
  ("PARTITION_KEY_COL" VARCHAR2(8 CHAR) CONSTRAINT "AS_LIST_TABLE_NN01" NOT NULL ENABLE,
  "PK_COL" NUMBER(9,0) CONSTRAINT "AS_LIST_TABLE_NN02" NOT NULL ENABLE
  )
PARTITION BY LIST ("PARTITION_KEY_COL")
(PARTITION "PART_01"  VALUES ('PART_01') SEGMENT CREATION IMMEDIATE,
PARTITION "PART_02"  VALUES ('PART_02') SEGMENT CREATION IMMEDIATE
);

create table AS_KEYS
(PARTITION_KEY_COL VARCHAR2(8 CHAR));

insert into AS_KEYS values ('PART_01');
commit;
explain plan for
select * from as_list_table t where t.partition_key_col in (select k.partition_key_col from as_keys k);

Plan hash value: 976037524

----------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name          | Rows  | Bytes | Cost (%CPU)| Time    | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT        |              |    1 |    49 |    5  (20)| 00:00:01 |      |      |
|  1 |  NESTED LOOPS            |              |    1 |    49 |    5  (20)| 00:00:01 |      |      |
|  2 |  SORT UNIQUE            |              |    1 |    18 |    3  (0)| 00:00:01 |      |      |
|  3 |    TABLE ACCESS FULL    | AS_KEYS      |    1 |    18 |    3  (0)| 00:00:01 |      |      |
|  4 |  PARTITION LIST ITERATOR|              |    1 |    31 |    1  (0)| 00:00:01 |  KEY |  KEY |
|*  5 |    TABLE ACCESS FULL    | AS_LIST_TABLE |    1 |    31 |    1  (0)| 00:00:01 |  KEY |  KEY |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

  5 - filter("T"."PARTITION_KEY_COL"="K"."PARTITION_KEY_COL")

And you can see from the second monitor output that the plan isn't inherently bad, it's restricted to 1 partition (there's no data for one of the values in the GTT in this table).

All partitions are supposed to hold only one value of the partition key, this is mainly to keep things simple and became easy with automatic list partitioning (obviously we forgot the golden rule of new features) but we also get concurrency benefits by having the data physically separated.

I've gone through the DDL from dbms_metadata and confirmed there's no sharing of partitions. Even if values were to share partitions, the strangeness in the execution is going the other way around - it's reading more than one partition for one partition key value.

I agree, the FBI and index fast full scan are also candidates for the suspicion, but I would have expected if there's issues there it would be more possible to replicate. That's probably just based on gut feeling though.

GregV

Hi Andrew,

Is this problem happening with some partition keys only or can it happen wtih any of them? That is, does it affect several users?

Apart from more data read than necessary, you didn't say if it caused a performance problem but I guess this is the case. Have you tried rebuilding the indexes?

Jonathan Lewis

Andrew,

You're right about the iterator/single thing, of course; apologies.

One idea for a potential bug is adaptive NLJ/HJ.  Since you're on 12.2 you've probably got adaptive plans enabled. Maybe there's some glitch that allows Oracle to report parts of a nested loop join while executing part of the hash join.  In the absence of a Bloom filter being created the hash join code would be doing a full tablescan of every partition.  (This doesn't really work as an idea for me because there seem to be too many bits that have to be half-right to produce your result.)

Can you get an OUTLINE and PROJECTION for the bad plan. The projection might give some clue about why the comparison between the gtt and the main table. The outline might tell us that some funny parameter settings, or unlikely transformation steps, have appeared for the bad execution.

The NOTES section would also be useful since it will tell us whether the plan was using private GTT stats, whether or not it was adapative, and possibly a couple of other bits and pieces.

Does the query run sufficiently rarely that it would be safe to enable the CBO trace system wide for the guilty SQL_ID ?

How about enabling SQL_TRACE system wide (at the same time, maybe) for the SQL_ID ?

Regards

Jonathan Lewis

AndrewSayer

GregV wrote:

Hi Andrew,

Is this problem happening with some partition keys only or can it happen wtih any of them? That is, does it affect several users?

Apart from more data read than necessary, you didn't say if it caused a performance problem but I guess this is the case. Have you tried rebuilding the indexes?

Hi Greg,

The problem is happening from multiple users and doesn't seem to be dependent on what is populated in the GTT. It's hard to say 100%, but when this occurs, not a single execution completes without scanning every partition.

Yes, these queries are run by a process which will time them out after 10 minutes, the query is then retried from a new session. We have seen that instead of completing by 03:30 like normal, these processes don't finish until after 09:00.

Rebuilding the index is on my list of possible workarounds. I'm still struggling to replicate the problem on my dev environments so it's difficult to test anything.

AndrewSayer

Jonathan Lewis wrote:

Andrew,

You're right about the iterator/single thing, of course; apologies.

One idea for a potential bug is adaptive NLJ/HJ. Since you're on 12.2 you've probably got adaptive plans enabled. Maybe there's some glitch that allows Oracle to report parts of a nested loop join while executing part of the hash join. In the absence of a Bloom filter being created the hash join code would be doing a full tablescan of every partition. (This doesn't really work as an idea for me because there seem to be too many bits that have to be half-right to produce your result.)

Can you get an OUTLINE and PROJECTION for the bad plan. The projection might give some clue about why the comparison between the gtt and the main table. The outline might tell us that some funny parameter settings, or unlikely transformation steps, have appeared for the bad execution.

The NOTES section would also be useful since it will tell us whether the plan was using private GTT stats, whether or not it was adapative, and possibly a couple of other bits and pieces.

Does the query run sufficiently rarely that it would be safe to enable the CBO trace system wide for the guilty SQL_ID ?

How about enabling SQL_TRACE system wide (at the same time, maybe) for the SQL_ID ?

Regards

Jonathan Lewis

Good thinking on the adaptive joins but we've still got some workarounds from 12.1 enabled. The bad plan is also the good plan, I've checked the advanced formatted plan and all plans are identical. There's no notes in the plan, and nothing in the /other_xml/info from other_xml other than the standard (db_version, parse_scheam, plan_hash,plan_hash_2, plan_hash_full - PHV_2 = PHV_full).

On going through this data to sanitize it, I've realized the index in question isn't actually an FBI - the real DDL is:

CREATE INDEX AS_DATE_INDEX ON AS_AUTOLIST_TABLE (DATE_COL,DATA_OWNER_KEY) LOCAL COMPRESS 2;

Sorry for that!

Outline Data

-------------

  /*+

      BEGIN_OUTLINE_DATA

      IGNORE_OPTIM_EMBEDDED_HINTS

      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')

      DB_VERSION('12.2.0.1')

      OPT_PARAM('_optimizer_use_feedback' 'false')

      OPT_PARAM('_px_adaptive_dist_method' 'off')

      OPT_PARAM('_optimizer_strans_adaptive_pruning' 'false')

      OPT_PARAM('_optimizer_nlj_hj_adaptive_join' 'false')

      OPT_PARAM('optimizer_index_cost_adj' 50)

      ALL_ROWS

      USE_INVISIBLE_INDEXES

      OUTLINE_LEAF(@"SEL$5BA8D2B3")

      MERGE(@"SEL$385088EC" >"SEL$B6A7E33B")

      OUTLINE(@"SEL$B6A7E33B")

      ELIM_GROUPBY(@"SEL$5C160134")

      OUTLINE(@"SEL$385088EC")

      UNNEST(@"SEL$5")

      OUTLINE(@"SEL$5C160134")

      MERGE(@"SEL$335DD26A" >"SEL$1")

      OUTLINE(@"SEL$4")

      OUTLINE(@"SEL$5")

      OUTLINE(@"SEL$1")

      OUTLINE(@"SEL$335DD26A")

      MERGE(@"SEL$3" >"SEL$2")

      OUTLINE(@"SEL$2")

      OUTLINE(@"SEL$3")

      FULL(@"SEL$5BA8D2B3" "C"@"SEL$5")

      INDEX_FFS(@"SEL$5BA8D2B3" "L"@"SEL$4" ("AS_AUTOLIST_TABLE"."DATE_COL"

              "AS_AUTOLIST_TABLE"."DATA_OWNER_KEY"))

      LEADING(@"SEL$5BA8D2B3" "C"@"SEL$5" "L"@"SEL$4")

      USE_NL(@"SEL$5BA8D2B3" "L"@"SEL$4")

      USE_HASH_AGGREGATION(@"SEL$5BA8D2B3")

      SEMI_TO_INNER(@"SEL$5BA8D2B3" "C"@"SEL$5")

      END_OUTLINE_DATA

  */

Predicate Information (identified by operation id):

---------------------------------------------------

  6 - filter("C"."DATA_OWNER_KEY"="L"."DATA_OWNER_KEY")

Column Projection Information (identified by operation id):

-----------------------------------------------------------

  1 - TRUNC(INTERNAL_FUNCTION("L"."DATE_COL"),'fmdd')[8], COUNT(*)[22]

  2 - "L"."DATE_COL"[DATE,7]

  3 - (#keys=1) "C"."DATA_OWNER_KEY"[VARCHAR2,128]

  4 - (rowset=256) "C"."DATA_OWNER_KEY"[VARCHAR2,128]

  5 - "L"."DATE_COL"[DATE,7]

  6 - "L"."DATE_COL"[DATE,7]

Does the query run sufficiently rarely that it would be safe to enable the CBO trace system wide for the guilty SQL_ID ?

How about enabling SQL_TRACE system wide (at the same time, maybe) for the SQL_ID ?

I should be able to set this up on dev once we've managed to replicate it. It runs from many sessions at once in production which could cause a bit of an issue if we traced them. I might be able to do something with dbms_monitor so only one user is traced, unfortunately it only naturally gets executed (and hits the problem) in the middle of the night. I've also seen the 10128 event might be useful to trace for this but I think the only way to enable that for another session is with oradebug at runtime.

GregV

Yes, it gets very frustrating when you can't reproduce the problem. I see the execution plan goes for a nested loop probably because of the unique index. Does it make any difference if you force a semi join instead?

AndrewSayer

GregV wrote:

Yes, it gets very frustrating when you can't reproduce the problem. I see the execution plan goes for a nested loop probably because of the unique index. Does it make any difference if you force a semi join instead?

Do you mean force a hash join semi? Hinting at use_hash, it uses a partition list all with no bloom filter.

SQL_ID  7zcav7gdmarfu, child number 0

-------------------------------------

select /*+USE_HASH(@"SEL$5BA8D2B3" "L"@"SEL$4")*/date_col_day,

row_count from AS_AUTOLIST_TAB_AGG_VW

Plan hash value: 309570198

-------------------------------------------------------------------------------------------------------------------------------

| Id  | Operation              | Name                        | Rows  | Bytes |TempSpc| Cost (%CPU)| Time    | Pstart| Pstop |

-------------------------------------------------------------------------------------------------------------------------------

|  0 | SELECT STATEMENT        |                            |      |      |      |  440K(100)|          |      |      |

|  1 |  HASH GROUP BY          |                            |  3890K|    89M|  119M|  440K  (3)| 00:00:18 |      |      |

|*  2 |  HASH JOIN RIGHT SEMI  |                            |  3890K|    89M|      |  413K  (4)| 00:00:17 |      |      |

|  3 |    TABLE ACCESS FULL    | AS_PARTITION_KEY_GTT        |    2 |    18 |      |    2  (0)| 00:00:01 |      |      |

|  4 |    PARTITION LIST ALL  |                            |  1118M|    15G|      |  407K  (2)| 00:00:16 |    1 |  581 |

|  5 |    INDEX FAST FULL SCAN| AS_DATE_INDEX              |  1118M|    15G|      |  407K  (2)| 00:00:16 |    1 |  581 |

-------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

  2 - access("C"."DATA_OWNER_KEY"="L"."DATA_OWNER_KEY")

We are going to roll a change to the index so that it leads on the partition key (so if it does end up forgetting to partition prune, it can still use this as an access predicate) and stores the trunc(date_col) expression instead of date_col and compressed which should help reduce the amount necessary to read anyway.

Because we only managed to hit the issue in production during the quiet hours of the morning, I'm not sure I'm going to be able to come up with a method to replicate it which will likely mean the SR I've raised will go nowhere. It might be a race condition we are hitting as we can only see it when many sessions are executing the query at once, but this probably isn't the only ingredient.

Jonathan Lewis

Andrew,

I've created your tables (no date, gtt with just one column), and set the OPT_PARAM hints to get a plan, then I've edited the outline to use_hash() no_swap_join_Inputs() where the default plan was a nested loop.

Two things I note:

a) You have a SEMI join - despite a hint in the outline that says semi_to_inner

b) You have a SEL$5 in your outline - I only get up to sel$4.

My semi_to_inner works and I get a Bloom filter.

There must be something (maybe a VPD predicate) that disappears because it defaults null or 1=1 that is introducing another query block in your code. Have you tried a 10053 trace yet ?

The concurrency thing reminds me of a problem I came across several years (and versions) ago where code was doing exchange subpartition while other code was querying a single partition.  If the exchange took place in the tiny gap between the parse call and the exec call of the query the query would re-optimise on the exec and (possibly because we were depending on partition-wise joins) produce a plan that visited every partition and subpartition (and crash when it hit the one that was being exchanged).  Maybe something similar is happening here.

If you can enable system-wide tracing for the query you could check the tkprof files for "Misses in library cache during exec".

Are you seeing lots of separate child cursors for the query ?

Is this query executed so often that it parent cursors are obsoleted ? (unlikely in 12.2 with its default of 8192 child cursors).

Are the end-users using flashback queries on the view ?

Regards

Jonathan Lewis

Mohamed Houri

Jonathan Lewis wrote:

a) You have a SEMI join - despite a hint in the outline that says semi_to_inner

My semi_to_inner works and I get a Bloom filter.

@Jonathan

Nice observation.

In the Andrew NESTED LOOPS plan there is a SORT UNIQUE operation that eliminates duplicate rows from the outer table which can make Oracle confidently using an inner join in place of a semi join. In his HASH JOIN execution plan there is no such ‘’remove duplicate’’ operation from the outer table
or anything else like a unique index on
the join column of the outer table that allows a switch from a SEMI to INNER join.

Do you have a SORT UNIQUE operation in your execution plan?

@Andrew

I was going to suggest you tracing the SQL_ID as this might show recursive calls done at the FUNCTION BASED index,  but I see that the index is not an FBI index and that the issue is not easily reproducible.

Best regards

Mohamed Houri

Jonathan Lewis

There's a sort unique which, as you would expect, I can eliminate by adding a unique constraint to the column.

But that doesn't appear to be relevant - I can kick things around until I get a hash join right semi and I still get a Bloom filter.

@andrew - what does the definition of the GTT look like ? It's one thing we haven't seen yet.

Regards

Jonathan Lewis

AndrewSayer

Jonathan Lewis wrote:

Andrew,

I've created your tables (no date, gtt with just one column), and set the OPT_PARAM hints to get a plan, then I've edited the outline to use_hash() no_swap_join_Inputs() where the default plan was a nested loop.

Two things I note:

a) You have a SEMI join - despite a hint in the outline that says semi_to_inner

Thanks Jonathan,

The plan I see being used contains NESTED LOOPS rather than NESTED LOOPS SEMI, or are you referring to something else?

- Ah, you might have meant in the hash join plan, I didn't check the outline for this plan, all I added was the use_hash hint. Checking the outline now, there's no semi_to_inner there.

b) You have a SEL$5 in your outline - I only get up to sel$4.

Ah! There is one further view on top of the aggregate view - it's just doing select cols from view. I Should have realised I was simplifying out the additional query block for the sake of sharing. There's no VPD on either of the views, it's using the contents of the GTT to police that (it's owned by a different schema).

My semi_to_inner works and I get a Bloom filter.

I can get a bloom filter if I hint at a full table scan. Is your bloom filter plan also using an index fast full scan?

There must be something (maybe a VPD predicate) that disappears because it defaults null or 1=1 that is introducing another query block in your code. Have you tried a 10053 trace yet ?

I've got a 10053 trace from a dev environment, the same expected plan comes up, I can't see anything specific about how it's going to prune partitions when it executes (other than the KEY KEY in the final plan). I'll see if I can get a trace from production when the problem occurs again and see if there's any differences.

The concurrency thing reminds me of a problem I came across several years (and versions) ago where code was doing exchange subpartition while other code was querying a single partition.  If the exchange took place in the tiny gap between the parse call and the exec call of the query the query would re-optimise on the exec and (possibly because we were depending on partition-wise joins) produce a plan that visited every partition and subpartition (and crash when it hit the one that was being exchanged).  Maybe something similar is happening here.

That sort of situation is something I'm considering too but can't find any evidence for. The table is RO when it gets queried (it's populated by an ETL job much earlier), I've checked the SYS partition clean up jobs and none are active during this time, and the other active sessions (according to v$ash) look innocent enough (although I'm relying on sampled data). Having a look at the date columns in dba_objects, nothing did DDL to the table/index or partitions at the critical point.

If you can enable system-wide tracing for the query you could check the tkprof files for "Misses in library cache during exec".

Are you seeing lots of separate child cursors for the query ?

Is this query executed so often that it parent cursors are obsoleted ? (unlikely in 12.2 with its default of 8192 child cursors).

Are the end-users using flashback queries on the view ?

All executions have a child number of 0 according to v$active_session_history , there's also a child number 2 (according to dbms_xplan.display_cursor with null child number) but I didn't see any further detail than that. I'll  grab v$sql_shared_cursor in the next problem period. There's no flashback query in play here. I'll try enabling trace on the sql_id to see what comes up.

The GTT is very basic:

  CREATE GLOBAL TEMPORARY TABLE "AS_PARTITION_KEY_GTT"

   (    "DATA_OWNER_KEY" VARCHAR2(32 CHAR)

   ) ON COMMIT PRESERVE ROWS

(no indexes / constraints)

I notice that the length of the column is longer in the GTT than in the partitioned table (8 CHAR). I'm not sure that should affect how the bloom filter could appear for a full tablescan but not a fast full index scan.

Jonathan Lewis

Will investigate further this evening, but

a) I get semi_to_inner for both the NL and HJ (which means the sort unique appears)

b) I get the Bloom filter (and semi_to_inner) with the hash join when the partition table access is tablescan

c) When I hint the partition table access by index or index_ffs (date_col, data_owner_key not the FBI definiton) the Bloom filter disappears (semi_to_inner remains).

Regards

Jonathan Lewis

Jonathan Lewis

Andrew,

Sorry - no news.

The key points remain as:

Why do I see semi_to_inner / "sort unique" when you don't.  I see this whether I do a hash join or NLJ

Why do I see a Bloom filter when I force a hash join with full tablescan, but see the Bloom filter disappear if I force hash join with index fast full scan  or index full scan - is it an accident that it appears for the tablescan (the 10053 seems to say it won't!) or an accident that makes it disappear for the index options.

In fact, on a simple model - a basic join between the two tables (with a PK added to the gtt, and the gtt recreated so it's NOT a GTT) I can see a partition-join Bloom filter if I use a tablescan, and find that it disappears if I hint an index_ffs or index

Regards

Jonathan Lewis

AndrewSayer

Jonathan Lewis wrote:

Andrew,

Sorry - no news.

The key points remain as:

Why do I see semi_to_inner / "sort unique" when you don't. I see this whether I do a hash join or NLJ

Why do I see a Bloom filter when I force a hash join with full tablescan, but see the Bloom filter disappear if I force hash join with index fast full scan or index full scan - is it an accident that it appears for the tablescan (the 10053 seems to say it won't!) or an accident that makes it disappear for the index options.

In fact, on a simple model - a basic join between the two tables (with a PK added to the gtt, and the gtt recreated so it's NOT a GTT) I can see a partition-join Bloom filter if I use a tablescan, and find that it disappears if I hint an index_ffs or index

Regards

Jonathan Lewis

I do see the sort unique when it transforms the semi to inner join, I don't see it when I get the semi join. I also don't see the sort unique when the GTT has a primary key constraint.

Interestingly, I can't seem to put together a demo of bloom pruning against a partitioned index, or bloom filtering on an index. I've done some googling and can't see any mention that it's not allowed but also can't see any execution plans where it has been used. If the adaptive NLJ/HJ was possible then that would almost certainly be the problem - I'll have a search to make sure there's no issues with _optimizer_nlj_hj_adaptive_join (which is set as a result of optimizer_adaptive_plans=false), as this does seem a bit suspicious.

Unfortunately, still no traces for when the problem occurs. I'll continue to try to replicate the issue - a couple ideas that I've not investigated yet are direct path loads before the execution and gathering statistics (or setting them) during execution.

I did manage to do a 10128 trace but it didn't reveal how it decided which partitions to use - although it did list all partitions being involved inside the parse phase.

Dom Brooks

Andy,
Did you get anywhere with this?
Application has just upgraded to 19.6 and I've got a problem post-upgrade with PARTITION LIST ITERATOR reading too many partitions. I'm still analysing and theorising but this was the only decent search result I could find to indicate that maybe I wasn't the only one who had found this.

1 - 16
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Aug 9 2012
Added on Jul 12 2012
2 comments
206 views