Age Owner Branch data TLA Line data Source code
1 : : /*-------------------------------------------------------------------------
2 : : *
3 : : * tuplesort.c
4 : : * Generalized tuple sorting routines.
5 : : *
6 : : * This module provides a generalized facility for tuple sorting, which can be
7 : : * applied to different kinds of sortable objects. Implementation of
8 : : * the particular sorting variants is given in tuplesortvariants.c.
9 : : * This module works efficiently for both small and large amounts
10 : : * of data. Small amounts are sorted in-memory using qsort(). Large
11 : : * amounts are sorted using temporary files and a standard external sort
12 : : * algorithm.
13 : : *
14 : : * See Knuth, volume 3, for more than you want to know about external
15 : : * sorting algorithms. The algorithm we use is a balanced k-way merge.
16 : : * Before PostgreSQL 15, we used the polyphase merge algorithm (Knuth's
17 : : * Algorithm 5.4.2D), but with modern hardware, a straightforward balanced
18 : : * merge is better. Knuth is assuming that tape drives are expensive
19 : : * beasts, and in particular that there will always be many more runs than
20 : : * tape drives. The polyphase merge algorithm was good at keeping all the
21 : : * tape drives busy, but in our implementation a "tape drive" doesn't cost
22 : : * much more than a few Kb of memory buffers, so we can afford to have
23 : : * lots of them. In particular, if we can have as many tape drives as
24 : : * sorted runs, we can eliminate any repeated I/O at all.
25 : : *
26 : : * Historically, we divided the input into sorted runs using replacement
27 : : * selection, in the form of a priority tree implemented as a heap
28 : : * (essentially Knuth's Algorithm 5.2.3H), but now we always use quicksort
29 : : * for run generation.
30 : : *
31 : : * The approximate amount of memory allowed for any one sort operation
32 : : * is specified in kilobytes by the caller (most pass work_mem). Initially,
33 : : * we absorb tuples and simply store them in an unsorted array as long as
34 : : * we haven't exceeded workMem. If we reach the end of the input without
35 : : * exceeding workMem, we sort the array using qsort() and subsequently return
36 : : * tuples just by scanning the tuple array sequentially. If we do exceed
37 : : * workMem, we begin to emit tuples into sorted runs in temporary tapes.
38 : : * When tuples are dumped in batch after quicksorting, we begin a new run
39 : : * with a new output tape. If we reach the max number of tapes, we write
40 : : * subsequent runs on the existing tapes in a round-robin fashion. We will
41 : : * need multiple merge passes to finish the merge in that case. After the
42 : : * end of the input is reached, we dump out remaining tuples in memory into
43 : : * a final run, then merge the runs.
44 : : *
45 : : * When merging runs, we use a heap containing just the frontmost tuple from
46 : : * each source run; we repeatedly output the smallest tuple and replace it
47 : : * with the next tuple from its source tape (if any). When the heap empties,
48 : : * the merge is complete. The basic merge algorithm thus needs very little
49 : : * memory --- only M tuples for an M-way merge, and M is constrained to a
50 : : * small number. However, we can still make good use of our full workMem
51 : : * allocation by pre-reading additional blocks from each source tape. Without
52 : : * prereading, our access pattern to the temporary file would be very erratic;
53 : : * on average we'd read one block from each of M source tapes during the same
54 : : * time that we're writing M blocks to the output tape, so there is no
55 : : * sequentiality of access at all, defeating the read-ahead methods used by
56 : : * most Unix kernels. Worse, the output tape gets written into a very random
57 : : * sequence of blocks of the temp file, ensuring that things will be even
58 : : * worse when it comes time to read that tape. A straightforward merge pass
59 : : * thus ends up doing a lot of waiting for disk seeks. We can improve matters
60 : : * by prereading from each source tape sequentially, loading about workMem/M
61 : : * bytes from each tape in turn, and making the sequential blocks immediately
62 : : * available for reuse. This approach helps to localize both read and write
63 : : * accesses. The pre-reading is handled by logtape.c, we just tell it how
64 : : * much memory to use for the buffers.
65 : : *
66 : : * In the current code we determine the number of input tapes M on the basis
67 : : * of workMem: we want workMem/M to be large enough that we read a fair
68 : : * amount of data each time we read from a tape, so as to maintain the
69 : : * locality of access described above. Nonetheless, with large workMem we
70 : : * can have many tapes. The logical "tapes" are implemented by logtape.c,
71 : : * which avoids space wastage by recycling disk space as soon as each block
72 : : * is read from its "tape".
73 : : *
74 : : * When the caller requests random access to the sort result, we form
75 : : * the final sorted run on a logical tape which is then "frozen", so
76 : : * that we can access it randomly. When the caller does not need random
77 : : * access, we return from tuplesort_performsort() as soon as we are down
78 : : * to one run per logical tape. The final merge is then performed
79 : : * on-the-fly as the caller repeatedly calls tuplesort_getXXX; this
80 : : * saves one cycle of writing all the data out to disk and reading it in.
81 : : *
82 : : * This module supports parallel sorting. Parallel sorts involve coordination
83 : : * among one or more worker processes, and a leader process, each with its own
84 : : * tuplesort state. The leader process (or, more accurately, the
85 : : * Tuplesortstate associated with a leader process) creates a full tapeset
86 : : * consisting of worker tapes with one run to merge; a run for every
87 : : * worker process. This is then merged. Worker processes are guaranteed to
88 : : * produce exactly one output run from their partial input.
89 : : *
90 : : *
91 : : * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
92 : : * Portions Copyright (c) 1994, Regents of the University of California
93 : : *
94 : : * IDENTIFICATION
95 : : * src/backend/utils/sort/tuplesort.c
96 : : *
97 : : *-------------------------------------------------------------------------
98 : : */
99 : :
100 : : #include "postgres.h"
101 : :
102 : : #include <limits.h>
103 : :
104 : : #include "commands/tablespace.h"
105 : : #include "miscadmin.h"
106 : : #include "pg_trace.h"
107 : : #include "storage/shmem.h"
108 : : #include "utils/memutils.h"
109 : : #include "utils/pg_rusage.h"
110 : : #include "utils/tuplesort.h"
111 : :
112 : : /*
113 : : * Initial size of memtuples array. We're trying to select this size so that
114 : : * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
115 : : * allocation might possibly be lowered. However, we don't consider array sizes
116 : : * less than 1024.
117 : : *
118 : : */
119 : : #define INITIAL_MEMTUPSIZE Max(1024, \
120 : : ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
121 : :
122 : : /* GUC variables */
123 : : #ifdef TRACE_SORT
124 : : bool trace_sort = false;
125 : : #endif
126 : :
127 : : #ifdef DEBUG_BOUNDED_SORT
128 : : bool optimize_bounded_sort = true;
129 : : #endif
130 : :
131 : :
132 : : /*
133 : : * During merge, we use a pre-allocated set of fixed-size slots to hold
134 : : * tuples. To avoid palloc/pfree overhead.
135 : : *
136 : : * Merge doesn't require a lot of memory, so we can afford to waste some,
137 : : * by using gratuitously-sized slots. If a tuple is larger than 1 kB, the
138 : : * palloc() overhead is not significant anymore.
139 : : *
140 : : * 'nextfree' is valid when this chunk is in the free list. When in use, the
141 : : * slot holds a tuple.
142 : : */
143 : : #define SLAB_SLOT_SIZE 1024
144 : :
145 : : typedef union SlabSlot
146 : : {
147 : : union SlabSlot *nextfree;
148 : : char buffer[SLAB_SLOT_SIZE];
149 : : } SlabSlot;
150 : :
151 : : /*
152 : : * Possible states of a Tuplesort object. These denote the states that
153 : : * persist between calls of Tuplesort routines.
154 : : */
155 : : typedef enum
156 : : {
157 : : TSS_INITIAL, /* Loading tuples; still within memory limit */
158 : : TSS_BOUNDED, /* Loading tuples into bounded-size heap */
159 : : TSS_BUILDRUNS, /* Loading tuples; writing to tape */
160 : : TSS_SORTEDINMEM, /* Sort completed entirely in memory */
161 : : TSS_SORTEDONTAPE, /* Sort completed, final run is on tape */
162 : : TSS_FINALMERGE, /* Performing final merge on-the-fly */
163 : : } TupSortStatus;
164 : :
165 : : /*
166 : : * Parameters for calculation of number of tapes to use --- see inittapes()
167 : : * and tuplesort_merge_order().
168 : : *
169 : : * In this calculation we assume that each tape will cost us about 1 blocks
170 : : * worth of buffer space. This ignores the overhead of all the other data
171 : : * structures needed for each tape, but it's probably close enough.
172 : : *
173 : : * MERGE_BUFFER_SIZE is how much buffer space we'd like to allocate for each
174 : : * input tape, for pre-reading (see discussion at top of file). This is *in
175 : : * addition to* the 1 block already included in TAPE_BUFFER_OVERHEAD.
176 : : */
177 : : #define MINORDER 6 /* minimum merge order */
178 : : #define MAXORDER 500 /* maximum merge order */
179 : : #define TAPE_BUFFER_OVERHEAD BLCKSZ
180 : : #define MERGE_BUFFER_SIZE (BLCKSZ * 32)
181 : :
182 : :
183 : : /*
184 : : * Private state of a Tuplesort operation.
185 : : */
186 : : struct Tuplesortstate
187 : : {
188 : : TuplesortPublic base;
189 : : TupSortStatus status; /* enumerated value as shown above */
190 : : bool bounded; /* did caller specify a maximum number of
191 : : * tuples to return? */
192 : : bool boundUsed; /* true if we made use of a bounded heap */
193 : : int bound; /* if bounded, the maximum number of tuples */
194 : : int64 tupleMem; /* memory consumed by individual tuples.
195 : : * storing this separately from what we track
196 : : * in availMem allows us to subtract the
197 : : * memory consumed by all tuples when dumping
198 : : * tuples to tape */
199 : : int64 availMem; /* remaining memory available, in bytes */
200 : : int64 allowedMem; /* total memory allowed, in bytes */
201 : : int maxTapes; /* max number of input tapes to merge in each
202 : : * pass */
203 : : int64 maxSpace; /* maximum amount of space occupied among sort
204 : : * of groups, either in-memory or on-disk */
205 : : bool isMaxSpaceDisk; /* true when maxSpace is value for on-disk
206 : : * space, false when it's value for in-memory
207 : : * space */
208 : : TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
209 : : LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
210 : :
211 : : /*
212 : : * This array holds the tuples now in sort memory. If we are in state
213 : : * INITIAL, the tuples are in no particular order; if we are in state
214 : : * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
215 : : * and FINALMERGE, the tuples are organized in "heap" order per Algorithm
216 : : * H. In state SORTEDONTAPE, the array is not used.
217 : : */
218 : : SortTuple *memtuples; /* array of SortTuple structs */
219 : : int memtupcount; /* number of tuples currently present */
220 : : int memtupsize; /* allocated length of memtuples array */
221 : : bool growmemtuples; /* memtuples' growth still underway? */
222 : :
223 : : /*
224 : : * Memory for tuples is sometimes allocated using a simple slab allocator,
225 : : * rather than with palloc(). Currently, we switch to slab allocation
226 : : * when we start merging. Merging only needs to keep a small, fixed
227 : : * number of tuples in memory at any time, so we can avoid the
228 : : * palloc/pfree overhead by recycling a fixed number of fixed-size slots
229 : : * to hold the tuples.
230 : : *
231 : : * For the slab, we use one large allocation, divided into SLAB_SLOT_SIZE
232 : : * slots. The allocation is sized to have one slot per tape, plus one
233 : : * additional slot. We need that many slots to hold all the tuples kept
234 : : * in the heap during merge, plus the one we have last returned from the
235 : : * sort, with tuplesort_gettuple.
236 : : *
237 : : * Initially, all the slots are kept in a linked list of free slots. When
238 : : * a tuple is read from a tape, it is put to the next available slot, if
239 : : * it fits. If the tuple is larger than SLAB_SLOT_SIZE, it is palloc'd
240 : : * instead.
241 : : *
242 : : * When we're done processing a tuple, we return the slot back to the free
243 : : * list, or pfree() if it was palloc'd. We know that a tuple was
244 : : * allocated from the slab, if its pointer value is between
245 : : * slabMemoryBegin and -End.
246 : : *
247 : : * When the slab allocator is used, the USEMEM/LACKMEM mechanism of
248 : : * tracking memory usage is not used.
249 : : */
250 : : bool slabAllocatorUsed;
251 : :
252 : : char *slabMemoryBegin; /* beginning of slab memory arena */
253 : : char *slabMemoryEnd; /* end of slab memory arena */
254 : : SlabSlot *slabFreeHead; /* head of free list */
255 : :
256 : : /* Memory used for input and output tape buffers. */
257 : : size_t tape_buffer_mem;
258 : :
259 : : /*
260 : : * When we return a tuple to the caller in tuplesort_gettuple_XXX, that
261 : : * came from a tape (that is, in TSS_SORTEDONTAPE or TSS_FINALMERGE
262 : : * modes), we remember the tuple in 'lastReturnedTuple', so that we can
263 : : * recycle the memory on next gettuple call.
264 : : */
265 : : void *lastReturnedTuple;
266 : :
267 : : /*
268 : : * While building initial runs, this is the current output run number.
269 : : * Afterwards, it is the number of initial runs we made.
270 : : */
271 : : int currentRun;
272 : :
273 : : /*
274 : : * Logical tapes, for merging.
275 : : *
276 : : * The initial runs are written in the output tapes. In each merge pass,
277 : : * the output tapes of the previous pass become the input tapes, and new
278 : : * output tapes are created as needed. When nInputTapes equals
279 : : * nInputRuns, there is only one merge pass left.
280 : : */
281 : : LogicalTape **inputTapes;
282 : : int nInputTapes;
283 : : int nInputRuns;
284 : :
285 : : LogicalTape **outputTapes;
286 : : int nOutputTapes;
287 : : int nOutputRuns;
288 : :
289 : : LogicalTape *destTape; /* current output tape */
290 : :
291 : : /*
292 : : * These variables are used after completion of sorting to keep track of
293 : : * the next tuple to return. (In the tape case, the tape's current read
294 : : * position is also critical state.)
295 : : */
296 : : LogicalTape *result_tape; /* actual tape of finished output */
297 : : int current; /* array index (only used if SORTEDINMEM) */
298 : : bool eof_reached; /* reached EOF (needed for cursors) */
299 : :
300 : : /* markpos_xxx holds marked position for mark and restore */
301 : : int64 markpos_block; /* tape block# (only used if SORTEDONTAPE) */
302 : : int markpos_offset; /* saved "current", or offset in tape block */
303 : : bool markpos_eof; /* saved "eof_reached" */
304 : :
305 : : /*
306 : : * These variables are used during parallel sorting.
307 : : *
308 : : * worker is our worker identifier. Follows the general convention that
309 : : * -1 value relates to a leader tuplesort, and values >= 0 worker
310 : : * tuplesorts. (-1 can also be a serial tuplesort.)
311 : : *
312 : : * shared is mutable shared memory state, which is used to coordinate
313 : : * parallel sorts.
314 : : *
315 : : * nParticipants is the number of worker Tuplesortstates known by the
316 : : * leader to have actually been launched, which implies that they must
317 : : * finish a run that the leader needs to merge. Typically includes a
318 : : * worker state held by the leader process itself. Set in the leader
319 : : * Tuplesortstate only.
320 : : */
321 : : int worker;
322 : : Sharedsort *shared;
323 : : int nParticipants;
324 : :
325 : : /*
326 : : * Additional state for managing "abbreviated key" sortsupport routines
327 : : * (which currently may be used by all cases except the hash index case).
328 : : * Tracks the intervals at which the optimization's effectiveness is
329 : : * tested.
330 : : */
331 : : int64 abbrevNext; /* Tuple # at which to next check
332 : : * applicability */
333 : :
334 : : /*
335 : : * Resource snapshot for time of sort start.
336 : : */
337 : : #ifdef TRACE_SORT
338 : : PGRUsage ru_start;
339 : : #endif
340 : : };
341 : :
342 : : /*
343 : : * Private mutable state of tuplesort-parallel-operation. This is allocated
344 : : * in shared memory.
345 : : */
346 : : struct Sharedsort
347 : : {
348 : : /* mutex protects all fields prior to tapes */
349 : : slock_t mutex;
350 : :
351 : : /*
352 : : * currentWorker generates ordinal identifier numbers for parallel sort
353 : : * workers. These start from 0, and are always gapless.
354 : : *
355 : : * Workers increment workersFinished to indicate having finished. If this
356 : : * is equal to state.nParticipants within the leader, leader is ready to
357 : : * merge worker runs.
358 : : */
359 : : int currentWorker;
360 : : int workersFinished;
361 : :
362 : : /* Temporary file space */
363 : : SharedFileSet fileset;
364 : :
365 : : /* Size of tapes flexible array */
366 : : int nTapes;
367 : :
368 : : /*
369 : : * Tapes array used by workers to report back information needed by the
370 : : * leader to concatenate all worker tapes into one for merging
371 : : */
372 : : TapeShare tapes[FLEXIBLE_ARRAY_MEMBER];
373 : : };
374 : :
375 : : /*
376 : : * Is the given tuple allocated from the slab memory arena?
377 : : */
378 : : #define IS_SLAB_SLOT(state, tuple) \
379 : : ((char *) (tuple) >= (state)->slabMemoryBegin && \
380 : : (char *) (tuple) < (state)->slabMemoryEnd)
381 : :
382 : : /*
383 : : * Return the given tuple to the slab memory free list, or free it
384 : : * if it was palloc'd.
385 : : */
386 : : #define RELEASE_SLAB_SLOT(state, tuple) \
387 : : do { \
388 : : SlabSlot *buf = (SlabSlot *) tuple; \
389 : : \
390 : : if (IS_SLAB_SLOT((state), buf)) \
391 : : { \
392 : : buf->nextfree = (state)->slabFreeHead; \
393 : : (state)->slabFreeHead = buf; \
394 : : } else \
395 : : pfree(buf); \
396 : : } while(0)
397 : :
398 : : #define REMOVEABBREV(state,stup,count) ((*(state)->base.removeabbrev) (state, stup, count))
399 : : #define COMPARETUP(state,a,b) ((*(state)->base.comparetup) (a, b, state))
400 : : #define WRITETUP(state,tape,stup) ((*(state)->base.writetup) (state, tape, stup))
401 : : #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
402 : : #define FREESTATE(state) ((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
403 : : #define LACKMEM(state) ((state)->availMem < 0 && !(state)->slabAllocatorUsed)
404 : : #define USEMEM(state,amt) ((state)->availMem -= (amt))
405 : : #define FREEMEM(state,amt) ((state)->availMem += (amt))
406 : : #define SERIAL(state) ((state)->shared == NULL)
407 : : #define WORKER(state) ((state)->shared && (state)->worker != -1)
408 : : #define LEADER(state) ((state)->shared && (state)->worker == -1)
409 : :
410 : : /*
411 : : * NOTES about on-tape representation of tuples:
412 : : *
413 : : * We require the first "unsigned int" of a stored tuple to be the total size
414 : : * on-tape of the tuple, including itself (so it is never zero; an all-zero
415 : : * unsigned int is used to delimit runs). The remainder of the stored tuple
416 : : * may or may not match the in-memory representation of the tuple ---
417 : : * any conversion needed is the job of the writetup and readtup routines.
418 : : *
419 : : * If state->sortopt contains TUPLESORT_RANDOMACCESS, then the stored
420 : : * representation of the tuple must be followed by another "unsigned int" that
421 : : * is a copy of the length --- so the total tape space used is actually
422 : : * sizeof(unsigned int) more than the stored length value. This allows
423 : : * read-backwards. When the random access flag was not specified, the
424 : : * write/read routines may omit the extra length word.
425 : : *
426 : : * writetup is expected to write both length words as well as the tuple
427 : : * data. When readtup is called, the tape is positioned just after the
428 : : * front length word; readtup must read the tuple data and advance past
429 : : * the back length word (if present).
430 : : *
431 : : * The write/read routines can make use of the tuple description data
432 : : * stored in the Tuplesortstate record, if needed. They are also expected
433 : : * to adjust state->availMem by the amount of memory space (not tape space!)
434 : : * released or consumed. There is no error return from either writetup
435 : : * or readtup; they should ereport() on failure.
436 : : *
437 : : *
438 : : * NOTES about memory consumption calculations:
439 : : *
440 : : * We count space allocated for tuples against the workMem limit, plus
441 : : * the space used by the variable-size memtuples array. Fixed-size space
442 : : * is not counted; it's small enough to not be interesting.
443 : : *
444 : : * Note that we count actual space used (as shown by GetMemoryChunkSpace)
445 : : * rather than the originally-requested size. This is important since
446 : : * palloc can add substantial overhead. It's not a complete answer since
447 : : * we won't count any wasted space in palloc allocation blocks, but it's
448 : : * a lot better than what we were doing before 7.3. As of 9.6, a
449 : : * separate memory context is used for caller passed tuples. Resetting
450 : : * it at certain key increments significantly ameliorates fragmentation.
451 : : * readtup routines use the slab allocator (they cannot use
452 : : * the reset context because it gets deleted at the point that merging
453 : : * begins).
454 : : */
455 : :
456 : :
457 : : static void tuplesort_begin_batch(Tuplesortstate *state);
458 : : static bool consider_abort_common(Tuplesortstate *state);
459 : : static void inittapes(Tuplesortstate *state, bool mergeruns);
460 : : static void inittapestate(Tuplesortstate *state, int maxTapes);
461 : : static void selectnewtape(Tuplesortstate *state);
462 : : static void init_slab_allocator(Tuplesortstate *state, int numSlots);
463 : : static void mergeruns(Tuplesortstate *state);
464 : : static void mergeonerun(Tuplesortstate *state);
465 : : static void beginmerge(Tuplesortstate *state);
466 : : static bool mergereadnext(Tuplesortstate *state, LogicalTape *srcTape, SortTuple *stup);
467 : : static void dumptuples(Tuplesortstate *state, bool alltuples);
468 : : static void make_bounded_heap(Tuplesortstate *state);
469 : : static void sort_bounded_heap(Tuplesortstate *state);
470 : : static void tuplesort_sort_memtuples(Tuplesortstate *state);
471 : : static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple);
472 : : static void tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple);
473 : : static void tuplesort_heap_delete_top(Tuplesortstate *state);
474 : : static void reversedirection(Tuplesortstate *state);
475 : : static unsigned int getlen(LogicalTape *tape, bool eofOK);
476 : : static void markrunend(LogicalTape *tape);
477 : : static int worker_get_identifier(Tuplesortstate *state);
478 : : static void worker_freeze_result_tape(Tuplesortstate *state);
479 : : static void worker_nomergeruns(Tuplesortstate *state);
480 : : static void leader_takeover_tapes(Tuplesortstate *state);
481 : : static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
482 : : static void tuplesort_free(Tuplesortstate *state);
483 : : static void tuplesort_updatemax(Tuplesortstate *state);
484 : :
485 : : /*
486 : : * Specialized comparators that we can inline into specialized sorts. The goal
487 : : * is to try to sort two tuples without having to follow the pointers to the
488 : : * comparator or the tuple.
489 : : *
490 : : * XXX: For now, there is no specialization for cases where datum1 is
491 : : * authoritative and we don't even need to fall back to a callback at all (that
492 : : * would be true for types like int4/int8/timestamp/date, but not true for
493 : : * abbreviations of text or multi-key sorts. There could be! Is it worth it?
494 : : */
495 : :
496 : : /* Used if first key's comparator is ssup_datum_unsigned_cmp */
497 : : static pg_attribute_always_inline int
743 john.naylor@postgres 498 :CBC 23184484 : qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
499 : : {
500 : : int compare;
501 : :
502 : 23184484 : compare = ApplyUnsignedSortComparator(a->datum1, a->isnull1,
503 : 23184484 : b->datum1, b->isnull1,
504 : : &state->base.sortKeys[0]);
505 [ + + ]: 23184484 : if (compare != 0)
506 : 20921849 : return compare;
507 : :
508 : : /*
509 : : * No need to waste effort calling the tiebreak function when there are no
510 : : * other keys to sort on.
511 : : */
627 akorotkov@postgresql 512 [ - + ]: 2262635 : if (state->base.onlyKey != NULL)
723 drowley@postgresql.o 513 :UBC 0 : return 0;
514 : :
242 john.naylor@postgres 515 :GNC 2262635 : return state->base.comparetup_tiebreak(a, b, state);
516 : : }
517 : :
518 : : #if SIZEOF_DATUM >= 8
519 : : /* Used if first key's comparator is ssup_datum_signed_cmp */
520 : : static pg_attribute_always_inline int
743 john.naylor@postgres 521 :CBC 2815248 : qsort_tuple_signed_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
522 : : {
523 : : int compare;
524 : :
525 : 2815248 : compare = ApplySignedSortComparator(a->datum1, a->isnull1,
526 : 2815248 : b->datum1, b->isnull1,
527 : : &state->base.sortKeys[0]);
528 : :
529 [ + + ]: 2815248 : if (compare != 0)
530 : 2809295 : return compare;
531 : :
532 : : /*
533 : : * No need to waste effort calling the tiebreak function when there are no
534 : : * other keys to sort on.
535 : : */
627 akorotkov@postgresql 536 [ + + ]: 5953 : if (state->base.onlyKey != NULL)
723 drowley@postgresql.o 537 : 551 : return 0;
538 : :
242 john.naylor@postgres 539 :GNC 5402 : return state->base.comparetup_tiebreak(a, b, state);
540 : : }
541 : : #endif
542 : :
543 : : /* Used if first key's comparator is ssup_datum_int32_cmp */
544 : : static pg_attribute_always_inline int
743 john.naylor@postgres 545 :CBC 26137569 : qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
546 : : {
547 : : int compare;
548 : :
549 : 26137569 : compare = ApplyInt32SortComparator(a->datum1, a->isnull1,
703 tgl@sss.pgh.pa.us 550 : 26137569 : b->datum1, b->isnull1,
551 : : &state->base.sortKeys[0]);
552 : :
743 john.naylor@postgres 553 [ + + ]: 26137569 : if (compare != 0)
554 : 18594944 : return compare;
555 : :
556 : : /*
557 : : * No need to waste effort calling the tiebreak function when there are no
558 : : * other keys to sort on.
559 : : */
627 akorotkov@postgresql 560 [ + + ]: 7542625 : if (state->base.onlyKey != NULL)
723 drowley@postgresql.o 561 : 839674 : return 0;
562 : :
242 john.naylor@postgres 563 :GNC 6702951 : return state->base.comparetup_tiebreak(a, b, state);
564 : : }
565 : :
566 : : /*
567 : : * Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
568 : : * any variant of SortTuples, using the appropriate comparetup function.
569 : : * qsort_ssup() is specialized for the case where the comparetup function
570 : : * reduces to ApplySortComparator(), that is single-key MinimalTuple sorts
571 : : * and Datum sorts. qsort_tuple_{unsigned,signed,int32} are specialized for
572 : : * common comparison functions on pass-by-value leading datums.
573 : : */
574 : :
575 : : #define ST_SORT qsort_tuple_unsigned
576 : : #define ST_ELEMENT_TYPE SortTuple
577 : : #define ST_COMPARE(a, b, state) qsort_tuple_unsigned_compare(a, b, state)
578 : : #define ST_COMPARE_ARG_TYPE Tuplesortstate
579 : : #define ST_CHECK_FOR_INTERRUPTS
580 : : #define ST_SCOPE static
581 : : #define ST_DEFINE
582 : : #include "lib/sort_template.h"
583 : :
584 : : #if SIZEOF_DATUM >= 8
585 : : #define ST_SORT qsort_tuple_signed
586 : : #define ST_ELEMENT_TYPE SortTuple
587 : : #define ST_COMPARE(a, b, state) qsort_tuple_signed_compare(a, b, state)
588 : : #define ST_COMPARE_ARG_TYPE Tuplesortstate
589 : : #define ST_CHECK_FOR_INTERRUPTS
590 : : #define ST_SCOPE static
591 : : #define ST_DEFINE
592 : : #include "lib/sort_template.h"
593 : : #endif
594 : :
595 : : #define ST_SORT qsort_tuple_int32
596 : : #define ST_ELEMENT_TYPE SortTuple
597 : : #define ST_COMPARE(a, b, state) qsort_tuple_int32_compare(a, b, state)
598 : : #define ST_COMPARE_ARG_TYPE Tuplesortstate
599 : : #define ST_CHECK_FOR_INTERRUPTS
600 : : #define ST_SCOPE static
601 : : #define ST_DEFINE
602 : : #include "lib/sort_template.h"
603 : :
604 : : #define ST_SORT qsort_tuple
605 : : #define ST_ELEMENT_TYPE SortTuple
606 : : #define ST_COMPARE_RUNTIME_POINTER
607 : : #define ST_COMPARE_ARG_TYPE Tuplesortstate
608 : : #define ST_CHECK_FOR_INTERRUPTS
609 : : #define ST_SCOPE static
610 : : #define ST_DECLARE
611 : : #define ST_DEFINE
612 : : #include "lib/sort_template.h"
613 : :
614 : : #define ST_SORT qsort_ssup
615 : : #define ST_ELEMENT_TYPE SortTuple
616 : : #define ST_COMPARE(a, b, ssup) \
617 : : ApplySortComparator((a)->datum1, (a)->isnull1, \
618 : : (b)->datum1, (b)->isnull1, (ssup))
619 : : #define ST_COMPARE_ARG_TYPE SortSupportData
620 : : #define ST_CHECK_FOR_INTERRUPTS
621 : : #define ST_SCOPE static
622 : : #define ST_DEFINE
623 : : #include "lib/sort_template.h"
624 : :
625 : : /*
626 : : * tuplesort_begin_xxx
627 : : *
628 : : * Initialize for a tuple sort operation.
629 : : *
630 : : * After calling tuplesort_begin, the caller should call tuplesort_putXXX
631 : : * zero or more times, then call tuplesort_performsort when all the tuples
632 : : * have been supplied. After performsort, retrieve the tuples in sorted
633 : : * order by calling tuplesort_getXXX until it returns false/NULL. (If random
634 : : * access was requested, rescan, markpos, and restorepos can also be called.)
635 : : * Call tuplesort_end to terminate the operation and release memory/disk space.
636 : : *
637 : : * Each variant of tuplesort_begin has a workMem parameter specifying the
638 : : * maximum number of kilobytes of RAM to use before spilling data to disk.
639 : : * (The normal value of this parameter is work_mem, but some callers use
640 : : * other values.) Each variant also has a sortopt which is a bitmask of
641 : : * sort options. See TUPLESORT_* definitions in tuplesort.h
642 : : */
643 : :
644 : : Tuplesortstate *
741 drowley@postgresql.o 645 :CBC 117392 : tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt)
646 : : {
647 : : Tuplesortstate *state;
648 : : MemoryContext maincontext;
649 : : MemoryContext sortcontext;
650 : : MemoryContext oldcontext;
651 : :
652 : : /* See leader_takeover_tapes() remarks on random access support */
653 [ + + - + ]: 117392 : if (coordinate && (sortopt & TUPLESORT_RANDOMACCESS))
2263 rhaas@postgresql.org 654 [ # # ]:UBC 0 : elog(ERROR, "random access disallowed under parallel sort");
655 : :
656 : : /*
657 : : * Memory context surviving tuplesort_reset. This memory context holds
658 : : * data which is useful to keep while sorting multiple similar batches.
659 : : */
1469 tomas.vondra@postgre 660 :CBC 117392 : maincontext = AllocSetContextCreate(CurrentMemoryContext,
661 : : "TupleSort main",
662 : : ALLOCSET_DEFAULT_SIZES);
663 : :
664 : : /*
665 : : * Create a working memory context for one sort operation. The content of
666 : : * this context is deleted by tuplesort_reset.
667 : : */
668 : 117392 : sortcontext = AllocSetContextCreate(maincontext,
669 : : "TupleSort sort",
670 : : ALLOCSET_DEFAULT_SIZES);
671 : :
672 : : /*
673 : : * Additionally a working memory context for tuples is setup in
674 : : * tuplesort_begin_batch.
675 : : */
676 : :
677 : : /*
678 : : * Make the Tuplesortstate within the per-sortstate context. This way, we
679 : : * don't need a separate pfree() operation for it at shutdown.
680 : : */
681 : 117392 : oldcontext = MemoryContextSwitchTo(maincontext);
682 : :
7823 bruce@momjian.us 683 : 117392 : state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
684 : :
685 : : #ifdef TRACE_SORT
6768 tgl@sss.pgh.pa.us 686 [ - + ]: 117392 : if (trace_sort)
6768 tgl@sss.pgh.pa.us 687 :UBC 0 : pg_rusage_init(&state->ru_start);
688 : : #endif
689 : :
627 akorotkov@postgresql 690 :CBC 117392 : state->base.sortopt = sortopt;
691 : 117392 : state->base.tuples = true;
692 : 117392 : state->abbrevNext = 10;
693 : :
694 : : /*
695 : : * workMem is forced to be at least 64KB, the current minimum valid value
696 : : * for the work_mem GUC. This is a defense against parallel sort callers
697 : : * that divide out memory among many workers in a way that leaves each
698 : : * with very little memory.
699 : : */
2263 rhaas@postgresql.org 700 : 117392 : state->allowedMem = Max(workMem, 64) * (int64) 1024;
627 akorotkov@postgresql 701 : 117392 : state->base.sortcontext = sortcontext;
702 : 117392 : state->base.maincontext = maincontext;
703 : :
704 : : /*
705 : : * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
706 : : * see comments in grow_memtuples().
707 : : */
1469 tomas.vondra@postgre 708 : 117392 : state->memtupsize = INITIAL_MEMTUPSIZE;
709 : 117392 : state->memtuples = NULL;
710 : :
711 : : /*
712 : : * After all of the other non-parallel-related state, we setup all of the
713 : : * state needed for each batch.
714 : : */
715 : 117392 : tuplesort_begin_batch(state);
716 : :
717 : : /*
718 : : * Initialize parallel-related state based on coordination information
719 : : * from caller
720 : : */
2263 rhaas@postgresql.org 721 [ + + ]: 117392 : if (!coordinate)
722 : : {
723 : : /* Serial sort */
724 : 117078 : state->shared = NULL;
725 : 117078 : state->worker = -1;
726 : 117078 : state->nParticipants = -1;
727 : : }
728 [ + + ]: 314 : else if (coordinate->isWorker)
729 : : {
730 : : /* Parallel worker produces exactly one final run from all input */
731 : 210 : state->shared = coordinate->sharedsort;
732 : 210 : state->worker = worker_get_identifier(state);
733 : 210 : state->nParticipants = -1;
734 : : }
735 : : else
736 : : {
737 : : /* Parallel leader state only used for final merge */
738 : 104 : state->shared = coordinate->sharedsort;
739 : 104 : state->worker = -1;
740 : 104 : state->nParticipants = coordinate->nParticipants;
741 [ - + ]: 104 : Assert(state->nParticipants >= 1);
742 : : }
743 : :
6622 tgl@sss.pgh.pa.us 744 : 117392 : MemoryContextSwitchTo(oldcontext);
745 : :
8946 746 : 117392 : return state;
747 : : }
748 : :
749 : : /*
750 : : * tuplesort_begin_batch
751 : : *
752 : : * Setup, or reset, all state need for processing a new set of tuples with this
753 : : * sort state. Called both from tuplesort_begin_common (the first time sorting
754 : : * with this sort state) and tuplesort_reset (for subsequent usages).
755 : : */
756 : : static void
1469 tomas.vondra@postgre 757 : 118704 : tuplesort_begin_batch(Tuplesortstate *state)
758 : : {
759 : : MemoryContext oldcontext;
760 : :
627 akorotkov@postgresql 761 : 118704 : oldcontext = MemoryContextSwitchTo(state->base.maincontext);
762 : :
763 : : /*
764 : : * Caller tuple (e.g. IndexTuple) memory context.
765 : : *
766 : : * A dedicated child context used exclusively for caller passed tuples
767 : : * eases memory management. Resetting at key points reduces
768 : : * fragmentation. Note that the memtuples array of SortTuples is allocated
769 : : * in the parent context, not this context, because there is no need to
770 : : * free memtuples early. For bounded sorts, tuples may be pfreed in any
771 : : * order, so we use a regular aset.c context so that it can make use of
772 : : * free'd memory. When the sort is not bounded, we make use of a bump.c
773 : : * context as this keeps allocations more compact with less wastage.
774 : : * Allocations are also slightly more CPU efficient.
775 : : */
6 drowley@postgresql.o 776 [ + + ]:GNC 118704 : if (TupleSortUseBumpTupleCxt(state->base.sortopt))
777 : 118039 : state->base.tuplecontext = BumpContextCreate(state->base.sortcontext,
778 : : "Caller tuples",
779 : : ALLOCSET_DEFAULT_SIZES);
780 : : else
627 akorotkov@postgresql 781 :CBC 665 : state->base.tuplecontext = AllocSetContextCreate(state->base.sortcontext,
782 : : "Caller tuples",
783 : : ALLOCSET_DEFAULT_SIZES);
784 : :
785 : :
1469 tomas.vondra@postgre 786 : 118704 : state->status = TSS_INITIAL;
787 : 118704 : state->bounded = false;
788 : 118704 : state->boundUsed = false;
789 : :
790 : 118704 : state->availMem = state->allowedMem;
791 : :
792 : 118704 : state->tapeset = NULL;
793 : :
794 : 118704 : state->memtupcount = 0;
795 : :
796 : : /*
797 : : * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
798 : : * see comments in grow_memtuples().
799 : : */
800 : 118704 : state->growmemtuples = true;
801 : 118704 : state->slabAllocatorUsed = false;
802 [ + + - + ]: 118704 : if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
803 : : {
1469 tomas.vondra@postgre 804 :UBC 0 : pfree(state->memtuples);
805 : 0 : state->memtuples = NULL;
806 : 0 : state->memtupsize = INITIAL_MEMTUPSIZE;
807 : : }
1469 tomas.vondra@postgre 808 [ + + ]:CBC 118704 : if (state->memtuples == NULL)
809 : : {
810 : 117392 : state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
811 : 117392 : USEMEM(state, GetMemoryChunkSpace(state->memtuples));
812 : : }
813 : :
814 : : /* workMem must be large enough for the minimal memtuples array */
815 [ - + - - ]: 118704 : if (LACKMEM(state))
1469 tomas.vondra@postgre 816 [ # # ]:UBC 0 : elog(ERROR, "insufficient memory allowed for sort");
817 : :
1469 tomas.vondra@postgre 818 :CBC 118704 : state->currentRun = 0;
819 : :
820 : : /*
821 : : * Tape variables (inputTapes, outputTapes, etc.) will be initialized by
822 : : * inittapes(), if needed.
823 : : */
824 : :
909 heikki.linnakangas@i 825 : 118704 : state->result_tape = NULL; /* flag that result tape has not been formed */
826 : :
1469 tomas.vondra@postgre 827 : 118704 : MemoryContextSwitchTo(oldcontext);
828 : 118704 : }
829 : :
830 : : /*
831 : : * tuplesort_set_bound
832 : : *
833 : : * Advise tuplesort that at most the first N result tuples are required.
834 : : *
835 : : * Must be called before inserting any tuples. (Actually, we could allow it
836 : : * as long as the sort hasn't spilled to disk, but there seems no need for
837 : : * delayed calls at the moment.)
838 : : *
839 : : * This is a hint only. The tuplesort may still return more tuples than
840 : : * requested. Parallel leader tuplesorts will always ignore the hint.
841 : : */
842 : : void
6190 tgl@sss.pgh.pa.us 843 : 598 : tuplesort_set_bound(Tuplesortstate *state, int64 bound)
844 : : {
845 : : /* Assert we're called before loading any tuples */
1676 alvherre@alvh.no-ip. 846 [ + - - + ]: 598 : Assert(state->status == TSS_INITIAL && state->memtupcount == 0);
847 : : /* Assert we allow bounded sorts */
627 akorotkov@postgresql 848 [ - + ]: 598 : Assert(state->base.sortopt & TUPLESORT_ALLOWBOUNDED);
849 : : /* Can't set the bound twice, either */
6190 tgl@sss.pgh.pa.us 850 [ - + ]: 598 : Assert(!state->bounded);
851 : : /* Also, this shouldn't be called in a parallel worker */
2263 rhaas@postgresql.org 852 [ - + - - ]: 598 : Assert(!WORKER(state));
853 : :
854 : : /* Parallel leader allows but ignores hint */
1675 tgl@sss.pgh.pa.us 855 [ - + - - ]: 598 : if (LEADER(state))
1675 tgl@sss.pgh.pa.us 856 :UBC 0 : return;
857 : :
858 : : #ifdef DEBUG_BOUNDED_SORT
859 : : /* Honor GUC setting that disables the feature (for easy testing) */
860 : : if (!optimize_bounded_sort)
861 : : return;
862 : : #endif
863 : :
864 : : /* We want to be able to compute bound * 2, so limit the setting */
5995 bruce@momjian.us 865 [ - + ]:CBC 598 : if (bound > (int64) (INT_MAX / 2))
6190 tgl@sss.pgh.pa.us 866 :UBC 0 : return;
867 : :
6190 tgl@sss.pgh.pa.us 868 :CBC 598 : state->bounded = true;
869 : 598 : state->bound = (int) bound;
870 : :
871 : : /*
872 : : * Bounded sorts are not an effective target for abbreviated key
873 : : * optimization. Disable by setting state to be consistent with no
874 : : * abbreviation support.
875 : : */
627 akorotkov@postgresql 876 : 598 : state->base.sortKeys->abbrev_converter = NULL;
877 [ + + ]: 598 : if (state->base.sortKeys->abbrev_full_comparator)
878 : 8 : state->base.sortKeys->comparator = state->base.sortKeys->abbrev_full_comparator;
879 : :
880 : : /* Not strictly necessary, but be tidy */
881 : 598 : state->base.sortKeys->abbrev_abort = NULL;
882 : 598 : state->base.sortKeys->abbrev_full_comparator = NULL;
883 : : }
884 : :
885 : : /*
886 : : * tuplesort_used_bound
887 : : *
888 : : * Allow callers to find out if the sort state was able to use a bound.
889 : : */
890 : : bool
1469 tomas.vondra@postgre 891 : 55 : tuplesort_used_bound(Tuplesortstate *state)
892 : : {
893 : 55 : return state->boundUsed;
894 : : }
895 : :
896 : : /*
897 : : * tuplesort_free
898 : : *
899 : : * Internal routine for freeing resources of tuplesort.
900 : : */
901 : : static void
902 : 118584 : tuplesort_free(Tuplesortstate *state)
903 : : {
904 : : /* context swap probably not needed, but let's be safe */
627 akorotkov@postgresql 905 : 118584 : MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
906 : :
907 : : #ifdef TRACE_SORT
908 : : int64 spaceUsed;
909 : :
8946 tgl@sss.pgh.pa.us 910 [ + + ]: 118584 : if (state->tapeset)
6753 911 : 339 : spaceUsed = LogicalTapeSetBlocks(state->tapeset);
912 : : else
913 : 118245 : spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
914 : : #endif
915 : :
916 : : /*
917 : : * Delete temporary "tape" files, if any.
918 : : *
919 : : * Note: want to include this in reported total cost of sort, hence need
920 : : * for two #ifdef TRACE_SORT sections.
921 : : *
922 : : * We don't bother to destroy the individual tapes here. They will go away
923 : : * with the sortcontext. (In TSS_FINALMERGE state, we have closed
924 : : * finished tapes already.)
925 : : */
6622 926 [ + + ]: 118584 : if (state->tapeset)
927 : 339 : LogicalTapeSetClose(state->tapeset);
928 : :
929 : : #ifdef TRACE_SORT
6768 930 [ - + ]: 118584 : if (trace_sort)
931 : : {
6753 tgl@sss.pgh.pa.us 932 [ # # ]:UBC 0 : if (state->tapeset)
149 michael@paquier.xyz 933 [ # # # # ]:UNC 0 : elog(LOG, "%s of worker %d ended, %lld disk blocks used: %s",
934 : : SERIAL(state) ? "external sort" : "parallel external sort",
935 : : state->worker, (long long) spaceUsed, pg_rusage_show(&state->ru_start));
936 : : else
937 [ # # # # ]: 0 : elog(LOG, "%s of worker %d ended, %lld KB used: %s",
938 : : SERIAL(state) ? "internal sort" : "unperformed parallel sort",
939 : : state->worker, (long long) spaceUsed, pg_rusage_show(&state->ru_start));
940 : : }
941 : :
942 : : TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed);
943 : : #else
944 : :
945 : : /*
946 : : * If you disabled TRACE_SORT, you can still probe sort__done, but you
947 : : * ain't getting space-used stats.
948 : : */
949 : : TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
950 : : #endif
951 : :
627 akorotkov@postgresql 952 [ + + ]:CBC 118584 : FREESTATE(state);
6622 tgl@sss.pgh.pa.us 953 : 118584 : MemoryContextSwitchTo(oldcontext);
954 : :
955 : : /*
956 : : * Free the per-sort memory context, thereby releasing all working memory.
957 : : */
627 akorotkov@postgresql 958 : 118584 : MemoryContextReset(state->base.sortcontext);
1469 tomas.vondra@postgre 959 : 118584 : }
960 : :
961 : : /*
962 : : * tuplesort_end
963 : : *
964 : : * Release resources and clean up.
965 : : *
966 : : * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
967 : : * pointing to garbage. Be careful not to attempt to use or free such
968 : : * pointers afterwards!
969 : : */
970 : : void
971 : 117272 : tuplesort_end(Tuplesortstate *state)
972 : : {
973 : 117272 : tuplesort_free(state);
974 : :
975 : : /*
976 : : * Free the main memory context, including the Tuplesortstate struct
977 : : * itself.
978 : : */
627 akorotkov@postgresql 979 : 117272 : MemoryContextDelete(state->base.maincontext);
1469 tomas.vondra@postgre 980 : 117272 : }
981 : :
982 : : /*
983 : : * tuplesort_updatemax
984 : : *
985 : : * Update maximum resource usage statistics.
986 : : */
987 : : static void
988 : 1504 : tuplesort_updatemax(Tuplesortstate *state)
989 : : {
990 : : int64 spaceUsed;
991 : : bool isSpaceDisk;
992 : :
993 : : /*
994 : : * Note: it might seem we should provide both memory and disk usage for a
995 : : * disk-based sort. However, the current code doesn't track memory space
996 : : * accurately once we have begun to return tuples to the caller (since we
997 : : * don't account for pfree's the caller is expected to do), so we cannot
998 : : * rely on availMem in a disk sort. This does not seem worth the overhead
999 : : * to fix. Is it worth creating an API for the memory context code to
1000 : : * tell us how much is actually used in sortcontext?
1001 : : */
1002 [ - + ]: 1504 : if (state->tapeset)
1003 : : {
1469 tomas.vondra@postgre 1004 :UBC 0 : isSpaceDisk = true;
1005 : 0 : spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
1006 : : }
1007 : : else
1008 : : {
1469 tomas.vondra@postgre 1009 :CBC 1504 : isSpaceDisk = false;
1010 : 1504 : spaceUsed = state->allowedMem - state->availMem;
1011 : : }
1012 : :
1013 : : /*
1014 : : * Sort evicts data to the disk when it wasn't able to fit that data into
1015 : : * main memory. This is why we assume space used on the disk to be more
1016 : : * important for tracking resource usage than space used in memory. Note
1017 : : * that the amount of space occupied by some tupleset on the disk might be
1018 : : * less than amount of space occupied by the same tupleset in memory due
1019 : : * to more compact representation.
1020 : : */
1021 [ - + - - ]: 1504 : if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
1022 [ + - + + ]: 1504 : (isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
1023 : : {
1024 : 199 : state->maxSpace = spaceUsed;
1025 : 199 : state->isMaxSpaceDisk = isSpaceDisk;
1026 : 199 : state->maxSpaceStatus = state->status;
1027 : : }
1028 : 1504 : }
1029 : :
1030 : : /*
1031 : : * tuplesort_reset
1032 : : *
1033 : : * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
1034 : : * meta-information in. After tuplesort_reset, tuplesort is ready to start
1035 : : * a new sort. This allows avoiding recreation of tuple sort states (and
1036 : : * save resources) when sorting multiple small batches.
1037 : : */
1038 : : void
1039 : 1312 : tuplesort_reset(Tuplesortstate *state)
1040 : : {
1041 : 1312 : tuplesort_updatemax(state);
1042 : 1312 : tuplesort_free(state);
1043 : :
1044 : : /*
1045 : : * After we've freed up per-batch memory, re-setup all of the state common
1046 : : * to both the first batch and any subsequent batch.
1047 : : */
1048 : 1312 : tuplesort_begin_batch(state);
1049 : :
1050 : 1312 : state->lastReturnedTuple = NULL;
1051 : 1312 : state->slabMemoryBegin = NULL;
1052 : 1312 : state->slabMemoryEnd = NULL;
1053 : 1312 : state->slabFreeHead = NULL;
6622 tgl@sss.pgh.pa.us 1054 : 1312 : }
1055 : :
1056 : : /*
1057 : : * Grow the memtuples[] array, if possible within our memory constraint. We
1058 : : * must not exceed INT_MAX tuples in memory or the caller-provided memory
1059 : : * limit. Return true if we were able to enlarge the array, false if not.
1060 : : *
1061 : : * Normally, at each increment we double the size of the array. When doing
1062 : : * that would exceed a limit, we attempt one last, smaller increase (and then
1063 : : * clear the growmemtuples flag so we don't try any more). That allows us to
1064 : : * use memory as fully as permitted; sticking to the pure doubling rule could
1065 : : * result in almost half going unused. Because availMem moves around with
1066 : : * tuple addition/removal, we need some rule to prevent making repeated small
1067 : : * increases in memtupsize, which would just be useless thrashing. The
1068 : : * growmemtuples flag accomplishes that and also prevents useless
1069 : : * recalculations in this function.
1070 : : */
1071 : : static bool
1072 : 3510 : grow_memtuples(Tuplesortstate *state)
1073 : : {
1074 : : int newmemtupsize;
4105 1075 : 3510 : int memtupsize = state->memtupsize;
3937 noah@leadboat.com 1076 : 3510 : int64 memNowUsed = state->allowedMem - state->availMem;
1077 : :
1078 : : /* Forget it if we've already maxed out memtuples, per comment above */
4105 tgl@sss.pgh.pa.us 1079 [ + + ]: 3510 : if (!state->growmemtuples)
1080 : 57 : return false;
1081 : :
1082 : : /* Select new value of memtupsize */
1083 [ + + ]: 3453 : if (memNowUsed <= state->availMem)
1084 : : {
1085 : : /*
1086 : : * We've used no more than half of allowedMem; double our usage,
1087 : : * clamping at INT_MAX tuples.
1088 : : */
3944 noah@leadboat.com 1089 [ + - ]: 3394 : if (memtupsize < INT_MAX / 2)
1090 : 3394 : newmemtupsize = memtupsize * 2;
1091 : : else
1092 : : {
3944 noah@leadboat.com 1093 :UBC 0 : newmemtupsize = INT_MAX;
1094 : 0 : state->growmemtuples = false;
1095 : : }
1096 : : }
1097 : : else
1098 : : {
1099 : : /*
1100 : : * This will be the last increment of memtupsize. Abandon doubling
1101 : : * strategy and instead increase as much as we safely can.
1102 : : *
1103 : : * To stay within allowedMem, we can't increase memtupsize by more
1104 : : * than availMem / sizeof(SortTuple) elements. In practice, we want
1105 : : * to increase it by considerably less, because we need to leave some
1106 : : * space for the tuples to which the new array slots will refer. We
1107 : : * assume the new tuples will be about the same size as the tuples
1108 : : * we've already seen, and thus we can extrapolate from the space
1109 : : * consumption so far to estimate an appropriate new size for the
1110 : : * memtuples array. The optimal value might be higher or lower than
1111 : : * this estimate, but it's hard to know that in advance. We again
1112 : : * clamp at INT_MAX tuples.
1113 : : *
1114 : : * This calculation is safe against enlarging the array so much that
1115 : : * LACKMEM becomes true, because the memory currently used includes
1116 : : * the present array; thus, there would be enough allowedMem for the
1117 : : * new array elements even if no other memory were currently used.
1118 : : *
1119 : : * We do the arithmetic in float8, because otherwise the product of
1120 : : * memtupsize and allowedMem could overflow. Any inaccuracy in the
1121 : : * result should be insignificant; but even if we computed a
1122 : : * completely insane result, the checks below will prevent anything
1123 : : * really bad from happening.
1124 : : */
1125 : : double grow_ratio;
1126 : :
4105 tgl@sss.pgh.pa.us 1127 :CBC 59 : grow_ratio = (double) state->allowedMem / (double) memNowUsed;
3944 noah@leadboat.com 1128 [ + - ]: 59 : if (memtupsize * grow_ratio < INT_MAX)
1129 : 59 : newmemtupsize = (int) (memtupsize * grow_ratio);
1130 : : else
3944 noah@leadboat.com 1131 :UBC 0 : newmemtupsize = INT_MAX;
1132 : :
1133 : : /* We won't make any further enlargement attempts */
4105 tgl@sss.pgh.pa.us 1134 :CBC 59 : state->growmemtuples = false;
1135 : : }
1136 : :
1137 : : /* Must enlarge array by at least one element, else report failure */
1138 [ - + ]: 3453 : if (newmemtupsize <= memtupsize)
4105 tgl@sss.pgh.pa.us 1139 :UBC 0 : goto noalloc;
1140 : :
1141 : : /*
1142 : : * On a 32-bit machine, allowedMem could exceed MaxAllocHugeSize. Clamp
1143 : : * to ensure our request won't be rejected. Note that we can easily
1144 : : * exhaust address space before facing this outcome. (This is presently
1145 : : * impossible due to guc.c's MAX_KILOBYTES limitation on work_mem, but
1146 : : * don't rely on that at this distance.)
1147 : : */
3944 noah@leadboat.com 1148 [ - + ]:CBC 3453 : if ((Size) newmemtupsize >= MaxAllocHugeSize / sizeof(SortTuple))
1149 : : {
3944 noah@leadboat.com 1150 :UBC 0 : newmemtupsize = (int) (MaxAllocHugeSize / sizeof(SortTuple));
4105 tgl@sss.pgh.pa.us 1151 : 0 : state->growmemtuples = false; /* can't grow any more */
1152 : : }
1153 : :
1154 : : /*
1155 : : * We need to be sure that we do not cause LACKMEM to become true, else
1156 : : * the space management algorithm will go nuts. The code above should
1157 : : * never generate a dangerous request, but to be safe, check explicitly
1158 : : * that the array growth fits within availMem. (We could still cause
1159 : : * LACKMEM if the memory chunk overhead associated with the memtuples
1160 : : * array were to increase. That shouldn't happen because we chose the
1161 : : * initial array size large enough to ensure that palloc will be treating
1162 : : * both old and new arrays as separate chunks. But we'll check LACKMEM
1163 : : * explicitly below just in case.)
1164 : : */
3937 noah@leadboat.com 1165 [ - + ]:CBC 3453 : if (state->availMem < (int64) ((newmemtupsize - memtupsize) * sizeof(SortTuple)))
4105 tgl@sss.pgh.pa.us 1166 :UBC 0 : goto noalloc;
1167 : :
1168 : : /* OK, do it */
6622 tgl@sss.pgh.pa.us 1169 :CBC 3453 : FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
4105 1170 : 3453 : state->memtupsize = newmemtupsize;
6622 1171 : 3453 : state->memtuples = (SortTuple *)
3944 noah@leadboat.com 1172 : 3453 : repalloc_huge(state->memtuples,
1173 : 3453 : state->memtupsize * sizeof(SortTuple));
6622 tgl@sss.pgh.pa.us 1174 : 3453 : USEMEM(state, GetMemoryChunkSpace(state->memtuples));
1175 [ - + - - ]: 3453 : if (LACKMEM(state))
3176 tgl@sss.pgh.pa.us 1176 [ # # ]:UBC 0 : elog(ERROR, "unexpected out-of-memory situation in tuplesort");
6622 tgl@sss.pgh.pa.us 1177 :CBC 3453 : return true;
1178 : :
4105 tgl@sss.pgh.pa.us 1179 :UBC 0 : noalloc:
1180 : : /* If for any reason we didn't realloc, shut off future attempts */
1181 : 0 : state->growmemtuples = false;
1182 : 0 : return false;
1183 : : }
1184 : :
1185 : : /*
1186 : : * Shared code for tuple and datum cases.
1187 : : */
1188 : : void
6 drowley@postgresql.o 1189 :GNC 13920021 : tuplesort_puttuple_common(Tuplesortstate *state, SortTuple *tuple,
1190 : : bool useAbbrev, Size tuplen)
1191 : : {
627 akorotkov@postgresql 1192 :CBC 13920021 : MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
1193 : :
2263 rhaas@postgresql.org 1194 [ + + - + ]: 13920021 : Assert(!LEADER(state));
1195 : :
1196 : : /* account for the memory used for this tuple */
6 drowley@postgresql.o 1197 :GNC 13920021 : USEMEM(state, tuplen);
1198 : 13920021 : state->tupleMem += tuplen;
1199 : :
627 akorotkov@postgresql 1200 [ + + ]:CBC 13920021 : if (!useAbbrev)
1201 : : {
1202 : : /*
1203 : : * Leave ordinary Datum representation, or NULL value. If there is a
1204 : : * converter it won't expect NULL values, and cost model is not
1205 : : * required to account for NULL, so in that case we avoid calling
1206 : : * converter and just set datum1 to zeroed representation (to be
1207 : : * consistent, and to support cheap inequality tests for NULL
1208 : : * abbreviated keys).
1209 : : */
1210 : : }
1211 [ + + ]: 2209869 : else if (!consider_abort_common(state))
1212 : : {
1213 : : /* Store abbreviated key representation */
1214 : 2209821 : tuple->datum1 = state->base.sortKeys->abbrev_converter(tuple->datum1,
1215 : : state->base.sortKeys);
1216 : : }
1217 : : else
1218 : : {
1219 : : /*
1220 : : * Set state to be consistent with never trying abbreviation.
1221 : : *
1222 : : * Alter datum1 representation in already-copied tuples, so as to
1223 : : * ensure a consistent representation (current tuple was just
1224 : : * handled). It does not matter if some dumped tuples are already
1225 : : * sorted on tape, since serialized tuples lack abbreviated keys
1226 : : * (TSS_BUILDRUNS state prevents control reaching here in any case).
1227 : : */
1228 : 48 : REMOVEABBREV(state, state->memtuples, state->memtupcount);
1229 : : }
1230 : :
8946 tgl@sss.pgh.pa.us 1231 [ + + + - ]: 13920021 : switch (state->status)
1232 : : {
8207 bruce@momjian.us 1233 : 11533689 : case TSS_INITIAL:
1234 : :
1235 : : /*
1236 : : * Save the tuple into the unsorted array. First, grow the array
1237 : : * as needed. Note that we try to grow the array when there is
1238 : : * still one free slot remaining --- if we fail, there'll still be
1239 : : * room to store the incoming tuple, and then we'll switch to
1240 : : * tape-based operation.
1241 : : */
6622 tgl@sss.pgh.pa.us 1242 [ + + ]: 11533689 : if (state->memtupcount >= state->memtupsize - 1)
1243 : : {
1244 : 3510 : (void) grow_memtuples(state);
1245 [ - + ]: 3510 : Assert(state->memtupcount < state->memtupsize);
1246 : : }
1247 : 11533689 : state->memtuples[state->memtupcount++] = *tuple;
1248 : :
1249 : : /*
1250 : : * Check if it's time to switch over to a bounded heapsort. We do
1251 : : * so if the input tuple count exceeds twice the desired tuple
1252 : : * count (this is a heuristic for where heapsort becomes cheaper
1253 : : * than a quicksort), or if we've just filled workMem and have
1254 : : * enough tuples to meet the bound.
1255 : : *
1256 : : * Note that once we enter TSS_BOUNDED state we will always try to
1257 : : * complete the sort that way. In the worst case, if later input
1258 : : * tuples are larger than earlier ones, this might cause us to
1259 : : * exceed workMem significantly.
1260 : : */
6190 1261 [ + + ]: 11533689 : if (state->bounded &&
1262 [ + + ]: 38714 : (state->memtupcount > state->bound * 2 ||
1263 [ + + - + : 38501 : (state->memtupcount > state->bound && LACKMEM(state))))
- - ]
1264 : : {
1265 : : #ifdef TRACE_SORT
1266 [ - + ]: 213 : if (trace_sort)
6190 tgl@sss.pgh.pa.us 1267 [ # # ]:UBC 0 : elog(LOG, "switching to bounded heapsort at %d tuples: %s",
1268 : : state->memtupcount,
1269 : : pg_rusage_show(&state->ru_start));
1270 : : #endif
6190 tgl@sss.pgh.pa.us 1271 :CBC 213 : make_bounded_heap(state);
627 akorotkov@postgresql 1272 : 213 : MemoryContextSwitchTo(oldcontext);
6190 tgl@sss.pgh.pa.us 1273 : 213 : return;
1274 : : }
1275 : :
1276 : : /*
1277 : : * Done if we still fit in available memory and have array slots.
1278 : : */
6622 1279 [ + + - + : 11533476 : if (state->memtupcount < state->memtupsize && !LACKMEM(state))
- - ]
1280 : : {
627 akorotkov@postgresql 1281 : 11533419 : MemoryContextSwitchTo(oldcontext);
8946 tgl@sss.pgh.pa.us 1282 : 11533419 : return;
1283 : : }
1284 : :
1285 : : /*
1286 : : * Nope; time to switch to tape-based operation.
1287 : : */
2263 rhaas@postgresql.org 1288 : 57 : inittapes(state, true);
1289 : :
1290 : : /*
1291 : : * Dump all tuples.
1292 : : */
8946 tgl@sss.pgh.pa.us 1293 : 57 : dumptuples(state, false);
1294 : 57 : break;
1295 : :
6190 1296 : 1855896 : case TSS_BOUNDED:
1297 : :
1298 : : /*
1299 : : * We don't want to grow the array here, so check whether the new
1300 : : * tuple can be discarded before putting it in. This should be a
1301 : : * good speed optimization, too, since when there are many more
1302 : : * input tuples than the bound, most input tuples can be discarded
1303 : : * with just this one comparison. Note that because we currently
1304 : : * have the sort direction reversed, we must check for <= not >=.
1305 : : */
1306 [ + + ]: 1855896 : if (COMPARETUP(state, tuple, &state->memtuples[0]) <= 0)
1307 : : {
1308 : : /* new tuple <= top of the heap, so we can discard it */
1309 : 1605315 : free_sort_tuple(state, tuple);
4408 rhaas@postgresql.org 1310 [ - + ]: 1605315 : CHECK_FOR_INTERRUPTS();
1311 : : }
1312 : : else
1313 : : {
1314 : : /* discard top of heap, replacing it with the new tuple */
6190 tgl@sss.pgh.pa.us 1315 : 250581 : free_sort_tuple(state, &state->memtuples[0]);
2389 rhaas@postgresql.org 1316 : 250581 : tuplesort_heap_replace_top(state, tuple);
1317 : : }
6190 tgl@sss.pgh.pa.us 1318 : 1855896 : break;
1319 : :
8946 1320 : 530436 : case TSS_BUILDRUNS:
1321 : :
1322 : : /*
1323 : : * Save the tuple into the unsorted array (there must be space)
1324 : : */
2389 rhaas@postgresql.org 1325 : 530436 : state->memtuples[state->memtupcount++] = *tuple;
1326 : :
1327 : : /*
1328 : : * If we are over the memory limit, dump all tuples.
1329 : : */
8946 tgl@sss.pgh.pa.us 1330 : 530436 : dumptuples(state, false);
1331 : 530436 : break;
1332 : :
8946 tgl@sss.pgh.pa.us 1333 :UBC 0 : default:
7569 1334 [ # # ]: 0 : elog(ERROR, "invalid tuplesort state");
1335 : : break;
1336 : : }
627 akorotkov@postgresql 1337 :CBC 2386389 : MemoryContextSwitchTo(oldcontext);
1338 : : }
1339 : :
1340 : : static bool
3373 rhaas@postgresql.org 1341 : 2209869 : consider_abort_common(Tuplesortstate *state)
1342 : : {
627 akorotkov@postgresql 1343 [ - + ]: 2209869 : Assert(state->base.sortKeys[0].abbrev_converter != NULL);
1344 [ - + ]: 2209869 : Assert(state->base.sortKeys[0].abbrev_abort != NULL);
1345 [ - + ]: 2209869 : Assert(state->base.sortKeys[0].abbrev_full_comparator != NULL);
1346 : :
1347 : : /*
1348 : : * Check effectiveness of abbreviation optimization. Consider aborting
1349 : : * when still within memory limit.
1350 : : */
3373 rhaas@postgresql.org 1351 [ + + ]: 2209869 : if (state->status == TSS_INITIAL &&
1352 [ + + ]: 1973639 : state->memtupcount >= state->abbrevNext)
1353 : : {
1354 : 2452 : state->abbrevNext *= 2;
1355 : :
1356 : : /*
1357 : : * Check opclass-supplied abbreviation abort routine. It may indicate
1358 : : * that abbreviation should not proceed.
1359 : : */
627 akorotkov@postgresql 1360 [ + + ]: 2452 : if (!state->base.sortKeys->abbrev_abort(state->memtupcount,
1361 : : state->base.sortKeys))
3373 rhaas@postgresql.org 1362 : 2404 : return false;
1363 : :
1364 : : /*
1365 : : * Finally, restore authoritative comparator, and indicate that
1366 : : * abbreviation is not in play by setting abbrev_converter to NULL
1367 : : */
627 akorotkov@postgresql 1368 : 48 : state->base.sortKeys[0].comparator = state->base.sortKeys[0].abbrev_full_comparator;
1369 : 48 : state->base.sortKeys[0].abbrev_converter = NULL;
1370 : : /* Not strictly necessary, but be tidy */
1371 : 48 : state->base.sortKeys[0].abbrev_abort = NULL;
1372 : 48 : state->base.sortKeys[0].abbrev_full_comparator = NULL;
1373 : :
1374 : : /* Give up - expect original pass-by-value representation */
3373 rhaas@postgresql.org 1375 : 48 : return true;
1376 : : }
1377 : :
1378 : 2207417 : return false;
1379 : : }
1380 : :
1381 : : /*
1382 : : * All tuples have been provided; finish the sort.
1383 : : */
1384 : : void
8946 tgl@sss.pgh.pa.us 1385 : 100131 : tuplesort_performsort(Tuplesortstate *state)
1386 : : {
627 akorotkov@postgresql 1387 : 100131 : MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
1388 : :
1389 : : #ifdef TRACE_SORT
6768 tgl@sss.pgh.pa.us 1390 [ - + ]: 100131 : if (trace_sort)
1991 pg@bowt.ie 1391 [ # # ]:UBC 0 : elog(LOG, "performsort of worker %d starting: %s",
1392 : : state->worker, pg_rusage_show(&state->ru_start));
1393 : : #endif
1394 : :
8946 tgl@sss.pgh.pa.us 1395 [ + + + - ]:CBC 100131 : switch (state->status)
1396 : : {
8207 bruce@momjian.us 1397 : 99861 : case TSS_INITIAL:
1398 : :
1399 : : /*
1400 : : * We were able to accumulate all the tuples within the allowed
1401 : : * amount of memory, or leader to take over worker tapes
1402 : : */
2263 rhaas@postgresql.org 1403 [ + + ]: 99861 : if (SERIAL(state))
1404 : : {
1405 : : /* Just qsort 'em and we're done */
1406 : 99579 : tuplesort_sort_memtuples(state);
1407 : 99537 : state->status = TSS_SORTEDINMEM;
1408 : : }
1409 [ + - + + ]: 282 : else if (WORKER(state))
1410 : : {
1411 : : /*
1412 : : * Parallel workers must still dump out tuples to tape. No
1413 : : * merge is required to produce single output run, though.
1414 : : */
1415 : 210 : inittapes(state, false);
1416 : 210 : dumptuples(state, true);
1417 : 210 : worker_nomergeruns(state);
1418 : 210 : state->status = TSS_SORTEDONTAPE;
1419 : : }
1420 : : else
1421 : : {
1422 : : /*
1423 : : * Leader will take over worker tapes and merge worker runs.
1424 : : * Note that mergeruns sets the correct state->status.
1425 : : */
1426 : 72 : leader_takeover_tapes(state);
1427 : 72 : mergeruns(state);
1428 : : }
8946 tgl@sss.pgh.pa.us 1429 : 99819 : state->current = 0;
1430 : 99819 : state->eof_reached = false;
2263 rhaas@postgresql.org 1431 : 99819 : state->markpos_block = 0L;
8946 tgl@sss.pgh.pa.us 1432 : 99819 : state->markpos_offset = 0;
1433 : 99819 : state->markpos_eof = false;
1434 : 99819 : break;
1435 : :
6190 1436 : 213 : case TSS_BOUNDED:
1437 : :
1438 : : /*
1439 : : * We were able to accumulate all the tuples required for output
1440 : : * in memory, using a heap to eliminate excess tuples. Now we
1441 : : * have to transform the heap to a properly-sorted array. Note
1442 : : * that sort_bounded_heap sets the correct state->status.
1443 : : */
6070 1444 : 213 : sort_bounded_heap(state);
6190 1445 : 213 : state->current = 0;
1446 : 213 : state->eof_reached = false;
1447 : 213 : state->markpos_offset = 0;
1448 : 213 : state->markpos_eof = false;
1449 : 213 : break;
1450 : :
8946 1451 : 57 : case TSS_BUILDRUNS:
1452 : :
1453 : : /*
1454 : : * Finish tape-based sort. First, flush all tuples remaining in
1455 : : * memory out to tape; then merge until we have a single remaining
1456 : : * run (or, if !randomAccess and !WORKER(), one run per tape).
1457 : : * Note that mergeruns sets the correct state->status.
1458 : : */
1459 : 57 : dumptuples(state, true);
1460 : 57 : mergeruns(state);
1461 : 57 : state->eof_reached = false;
1462 : 57 : state->markpos_block = 0L;
1463 : 57 : state->markpos_offset = 0;
1464 : 57 : state->markpos_eof = false;
1465 : 57 : break;
1466 : :
8946 tgl@sss.pgh.pa.us 1467 :UBC 0 : default:
7569 1468 [ # # ]: 0 : elog(ERROR, "invalid tuplesort state");
1469 : : break;
1470 : : }
1471 : :
1472 : : #ifdef TRACE_SORT
6768 tgl@sss.pgh.pa.us 1473 [ - + ]:CBC 100089 : if (trace_sort)
1474 : : {
6612 tgl@sss.pgh.pa.us 1475 [ # # ]:UBC 0 : if (state->status == TSS_FINALMERGE)
1991 pg@bowt.ie 1476 [ # # ]: 0 : elog(LOG, "performsort of worker %d done (except %d-way final merge): %s",
1477 : : state->worker, state->nInputTapes,
1478 : : pg_rusage_show(&state->ru_start));
1479 : : else
1480 [ # # ]: 0 : elog(LOG, "performsort of worker %d done: %s",
1481 : : state->worker, pg_rusage_show(&state->ru_start));
1482 : : }
1483 : : #endif
1484 : :
6622 tgl@sss.pgh.pa.us 1485 :CBC 100089 : MemoryContextSwitchTo(oldcontext);
8946 1486 : 100089 : }
1487 : :
1488 : : /*
1489 : : * Internal routine to fetch the next tuple in either forward or back
1490 : : * direction into *stup. Returns false if no more tuples.
1491 : : * Returned tuple belongs to tuplesort memory context, and must not be freed
1492 : : * by caller. Note that fetched tuple is stored in memory that may be
1493 : : * recycled by any future fetch.
1494 : : */
1495 : : bool
6622 1496 : 12560998 : tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
1497 : : SortTuple *stup)
1498 : : {
1499 : : unsigned int tuplen;
1500 : : size_t nmoved;
1501 : :
2263 rhaas@postgresql.org 1502 [ + + - + ]: 12560998 : Assert(!WORKER(state));
1503 : :
8946 tgl@sss.pgh.pa.us 1504 [ + + + - ]: 12560998 : switch (state->status)
1505 : : {
1506 : 10146716 : case TSS_SORTEDINMEM:
627 akorotkov@postgresql 1507 [ + + - + ]: 10146716 : Assert(forward || state->base.sortopt & TUPLESORT_RANDOMACCESS);
2750 heikki.linnakangas@i 1508 [ - + ]: 10146716 : Assert(!state->slabAllocatorUsed);
8946 tgl@sss.pgh.pa.us 1509 [ + + ]: 10146716 : if (forward)
1510 : : {
1511 [ + + ]: 10146683 : if (state->current < state->memtupcount)
1512 : : {
6622 1513 : 10047806 : *stup = state->memtuples[state->current++];
1514 : 10047806 : return true;
1515 : : }
8946 1516 : 98877 : state->eof_reached = true;
1517 : :
1518 : : /*
1519 : : * Complain if caller tries to retrieve more tuples than
1520 : : * originally asked for in a bounded sort. This is because
1521 : : * returning EOF here might be the wrong thing.
1522 : : */
6190 1523 [ + + - + ]: 98877 : if (state->bounded && state->current >= state->bound)
6190 tgl@sss.pgh.pa.us 1524 [ # # ]:UBC 0 : elog(ERROR, "retrieved too many tuples in a bounded sort");
1525 : :
6622 tgl@sss.pgh.pa.us 1526 :CBC 98877 : return false;
1527 : : }
1528 : : else
1529 : : {
8946 1530 [ - + ]: 33 : if (state->current <= 0)
6622 tgl@sss.pgh.pa.us 1531 :UBC 0 : return false;
1532 : :
1533 : : /*
1534 : : * if all tuples are fetched already then we return last
1535 : : * tuple, else - tuple before last returned.
1536 : : */
8946 tgl@sss.pgh.pa.us 1537 [ + + ]:CBC 33 : if (state->eof_reached)
1538 : 6 : state->eof_reached = false;
1539 : : else
1540 : : {
8768 bruce@momjian.us 1541 : 27 : state->current--; /* last returned tuple */
8946 tgl@sss.pgh.pa.us 1542 [ + + ]: 27 : if (state->current <= 0)
6622 1543 : 3 : return false;
1544 : : }
1545 : 30 : *stup = state->memtuples[state->current - 1];
1546 : 30 : return true;
1547 : : }
1548 : : break;
1549 : :
8946 1550 : 136497 : case TSS_SORTEDONTAPE:
627 akorotkov@postgresql 1551 [ + + - + ]: 136497 : Assert(forward || state->base.sortopt & TUPLESORT_RANDOMACCESS);
2750 heikki.linnakangas@i 1552 [ - + ]: 136497 : Assert(state->slabAllocatorUsed);
1553 : :
1554 : : /*
1555 : : * The slot that held the tuple that we returned in previous
1556 : : * gettuple call can now be reused.
1557 : : */
1558 [ + + ]: 136497 : if (state->lastReturnedTuple)
1559 : : {
1560 [ + - + - ]: 76425 : RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
1561 : 76425 : state->lastReturnedTuple = NULL;
1562 : : }
1563 : :
8946 tgl@sss.pgh.pa.us 1564 [ + + ]: 136497 : if (forward)
1565 : : {
1566 [ - + ]: 136482 : if (state->eof_reached)
6622 tgl@sss.pgh.pa.us 1567 :UBC 0 : return false;
1568 : :
909 heikki.linnakangas@i 1569 [ + + ]:CBC 136482 : if ((tuplen = getlen(state->result_tape, true)) != 0)
1570 : : {
6622 tgl@sss.pgh.pa.us 1571 : 136470 : READTUP(state, stup, state->result_tape, tuplen);
1572 : :
1573 : : /*
1574 : : * Remember the tuple we return, so that we can recycle
1575 : : * its memory on next call. (This can be NULL, in the
1576 : : * !state->tuples case).
1577 : : */
2750 heikki.linnakangas@i 1578 : 136470 : state->lastReturnedTuple = stup->tuple;
1579 : :
6622 tgl@sss.pgh.pa.us 1580 : 136470 : return true;
1581 : : }
1582 : : else
1583 : : {
8946 1584 : 12 : state->eof_reached = true;
6622 1585 : 12 : return false;
1586 : : }
1587 : : }
1588 : :
1589 : : /*
1590 : : * Backward.
1591 : : *
1592 : : * if all tuples are fetched already then we return last tuple,
1593 : : * else - tuple before last returned.
1594 : : */
8946 1595 [ + + ]: 15 : if (state->eof_reached)
1596 : : {
1597 : : /*
1598 : : * Seek position is pointing just past the zero tuplen at the
1599 : : * end of file; back up to fetch last tuple's ending length
1600 : : * word. If seek fails we must have a completely empty file.
1601 : : */
909 heikki.linnakangas@i 1602 : 6 : nmoved = LogicalTapeBackspace(state->result_tape,
1603 : : 2 * sizeof(unsigned int));
2670 1604 [ - + ]: 6 : if (nmoved == 0)
6622 tgl@sss.pgh.pa.us 1605 :UBC 0 : return false;
2670 heikki.linnakangas@i 1606 [ - + ]:CBC 6 : else if (nmoved != 2 * sizeof(unsigned int))
2670 heikki.linnakangas@i 1607 [ # # ]:UBC 0 : elog(ERROR, "unexpected tape position");
8946 tgl@sss.pgh.pa.us 1608 :CBC 6 : state->eof_reached = false;
1609 : : }
1610 : : else
1611 : : {
1612 : : /*
1613 : : * Back up and fetch previously-returned tuple's ending length
1614 : : * word. If seek fails, assume we are at start of file.
1615 : : */
909 heikki.linnakangas@i 1616 : 9 : nmoved = LogicalTapeBackspace(state->result_tape,
1617 : : sizeof(unsigned int));
2670 1618 [ - + ]: 9 : if (nmoved == 0)
6622 tgl@sss.pgh.pa.us 1619 :UBC 0 : return false;
2670 heikki.linnakangas@i 1620 [ - + ]:CBC 9 : else if (nmoved != sizeof(unsigned int))
2670 heikki.linnakangas@i 1621 [ # # ]:UBC 0 : elog(ERROR, "unexpected tape position");
909 heikki.linnakangas@i 1622 :CBC 9 : tuplen = getlen(state->result_tape, false);
1623 : :
1624 : : /*
1625 : : * Back up to get ending length word of tuple before it.
1626 : : */
1627 : 9 : nmoved = LogicalTapeBackspace(state->result_tape,
1628 : : tuplen + 2 * sizeof(unsigned int));
2670 1629 [ + + ]: 9 : if (nmoved == tuplen + sizeof(unsigned int))
1630 : : {
1631 : : /*
1632 : : * We backed up over the previous tuple, but there was no
1633 : : * ending length word before it. That means that the prev
1634 : : * tuple is the first tuple in the file. It is now the
1635 : : * next to read in forward direction (not obviously right,
1636 : : * but that is what in-memory case does).
1637 : : */
6622 tgl@sss.pgh.pa.us 1638 : 3 : return false;
1639 : : }
2670 heikki.linnakangas@i 1640 [ - + ]: 6 : else if (nmoved != tuplen + 2 * sizeof(unsigned int))
2670 heikki.linnakangas@i 1641 [ # # ]:UBC 0 : elog(ERROR, "bogus tuple length in backward scan");
1642 : : }
1643 : :
909 heikki.linnakangas@i 1644 :CBC 12 : tuplen = getlen(state->result_tape, false);
1645 : :
1646 : : /*
1647 : : * Now we have the length of the prior tuple, back up and read it.
1648 : : * Note: READTUP expects we are positioned after the initial
1649 : : * length word of the tuple, so back up to that point.
1650 : : */
1651 : 12 : nmoved = LogicalTapeBackspace(state->result_tape,
1652 : : tuplen);
2670 1653 [ - + ]: 12 : if (nmoved != tuplen)
7569 tgl@sss.pgh.pa.us 1654 [ # # ]:UBC 0 : elog(ERROR, "bogus tuple length in backward scan");
6622 tgl@sss.pgh.pa.us 1655 :CBC 12 : READTUP(state, stup, state->result_tape, tuplen);
1656 : :
1657 : : /*
1658 : : * Remember the tuple we return, so that we can recycle its memory
1659 : : * on next call. (This can be NULL, in the Datum case).
1660 : : */
2750 heikki.linnakangas@i 1661 : 12 : state->lastReturnedTuple = stup->tuple;
1662 : :
6622 tgl@sss.pgh.pa.us 1663 : 12 : return true;
1664 : :
8946 1665 : 2277785 : case TSS_FINALMERGE:
1666 [ - + ]: 2277785 : Assert(forward);
1667 : : /* We are managing memory ourselves, with the slab allocator. */
2750 heikki.linnakangas@i 1668 [ - + ]: 2277785 : Assert(state->slabAllocatorUsed);
1669 : :
1670 : : /*
1671 : : * The slab slot holding the tuple that we returned in previous
1672 : : * gettuple call can now be reused.
1673 : : */
1674 [ + + ]: 2277785 : if (state->lastReturnedTuple)
1675 : : {
1676 [ + - + - ]: 2247650 : RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
1677 : 2247650 : state->lastReturnedTuple = NULL;
1678 : : }
1679 : :
1680 : : /*
1681 : : * This code should match the inner loop of mergeonerun().
1682 : : */
8933 tgl@sss.pgh.pa.us 1683 [ + + ]: 2277785 : if (state->memtupcount > 0)
1684 : : {
909 heikki.linnakangas@i 1685 : 2277674 : int srcTapeIndex = state->memtuples[0].srctape;
1686 : 2277674 : LogicalTape *srcTape = state->inputTapes[srcTapeIndex];
1687 : : SortTuple newtup;
1688 : :
2750 1689 : 2277674 : *stup = state->memtuples[0];
1690 : :
1691 : : /*
1692 : : * Remember the tuple we return, so that we can recycle its
1693 : : * memory on next call. (This can be NULL, in the Datum case).
1694 : : */
1695 : 2277674 : state->lastReturnedTuple = stup->tuple;
1696 : :
1697 : : /*
1698 : : * Pull next tuple from tape, and replace the returned tuple
1699 : : * at top of the heap with it.
1700 : : */
1701 [ + + ]: 2277674 : if (!mergereadnext(state, srcTape, &newtup))
1702 : : {
1703 : : /*
1704 : : * If no more data, we've reached end of run on this tape.
1705 : : * Remove the top node from the heap.
1706 : : */
2389 rhaas@postgresql.org 1707 : 173 : tuplesort_heap_delete_top(state);
909 heikki.linnakangas@i 1708 : 173 : state->nInputRuns--;
1709 : :
1710 : : /*
1711 : : * Close the tape. It'd go away at the end of the sort
1712 : : * anyway, but better to release the memory early.
1713 : : */
627 akorotkov@postgresql 1714 : 173 : LogicalTapeClose(srcTape);
1715 : 173 : return true;
1716 : : }
1717 : 2277501 : newtup.srctape = srcTapeIndex;
1718 : 2277501 : tuplesort_heap_replace_top(state, &newtup);
1719 : 2277501 : return true;
1720 : : }
1721 : 111 : return false;
1722 : :
627 akorotkov@postgresql 1723 :UBC 0 : default:
1724 [ # # ]: 0 : elog(ERROR, "invalid tuplesort state");
1725 : : return false; /* keep compiler quiet */
1726 : : }
1727 : : }
1728 : :
1729 : :
1730 : : /*
1731 : : * Advance over N tuples in either forward or back direction,
1732 : : * without returning any data. N==0 is a no-op.
1733 : : * Returns true if successful, false if ran out of tuples.
1734 : : */
1735 : : bool
3765 tgl@sss.pgh.pa.us 1736 :CBC 196 : tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
1737 : : {
1738 : : MemoryContext oldcontext;
1739 : :
1740 : : /*
1741 : : * We don't actually support backwards skip yet, because no callers need
1742 : : * it. The API is designed to allow for that later, though.
1743 : : */
1744 [ - + ]: 196 : Assert(forward);
1745 [ - + ]: 196 : Assert(ntuples >= 0);
2263 rhaas@postgresql.org 1746 [ - + - - ]: 196 : Assert(!WORKER(state));
1747 : :
3765 tgl@sss.pgh.pa.us 1748 [ + + - ]: 196 : switch (state->status)
1749 : : {
1750 : 184 : case TSS_SORTEDINMEM:
1751 [ + - ]: 184 : if (state->memtupcount - state->current >= ntuples)
1752 : : {
1753 : 184 : state->current += ntuples;
1754 : 184 : return true;
1755 : : }
3765 tgl@sss.pgh.pa.us 1756 :UBC 0 : state->current = state->memtupcount;
1757 : 0 : state->eof_reached = true;
1758 : :
1759 : : /*
1760 : : * Complain if caller tries to retrieve more tuples than
1761 : : * originally asked for in a bounded sort. This is because
1762 : : * returning EOF here might be the wrong thing.
1763 : : */
1764 [ # # # # ]: 0 : if (state->bounded && state->current >= state->bound)
1765 [ # # ]: 0 : elog(ERROR, "retrieved too many tuples in a bounded sort");
1766 : :
1767 : 0 : return false;
1768 : :
3765 tgl@sss.pgh.pa.us 1769 :CBC 12 : case TSS_SORTEDONTAPE:
1770 : : case TSS_FINALMERGE:
1771 : :
1772 : : /*
1773 : : * We could probably optimize these cases better, but for now it's
1774 : : * not worth the trouble.
1775 : : */
627 akorotkov@postgresql 1776 : 12 : oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
3765 tgl@sss.pgh.pa.us 1777 [ + + ]: 120066 : while (ntuples-- > 0)
1778 : : {
1779 : : SortTuple stup;
1780 : :
2680 rhaas@postgresql.org 1781 [ - + ]: 120054 : if (!tuplesort_gettuple_common(state, forward, &stup))
1782 : : {
3764 tgl@sss.pgh.pa.us 1783 :UBC 0 : MemoryContextSwitchTo(oldcontext);
3765 1784 : 0 : return false;
1785 : : }
3765 tgl@sss.pgh.pa.us 1786 [ - + ]:CBC 120054 : CHECK_FOR_INTERRUPTS();
1787 : : }
3764 1788 : 12 : MemoryContextSwitchTo(oldcontext);
3765 1789 : 12 : return true;
1790 : :
3765 tgl@sss.pgh.pa.us 1791 :UBC 0 : default:
1792 [ # # ]: 0 : elog(ERROR, "invalid tuplesort state");
1793 : : return false; /* keep compiler quiet */
1794 : : }
1795 : : }
1796 : :
1797 : : /*
1798 : : * tuplesort_merge_order - report merge order we'll use for given memory
1799 : : * (note: "merge order" just means the number of input tapes in the merge).
1800 : : *
1801 : : * This is exported for use by the planner. allowedMem is in bytes.
1802 : : */
1803 : : int
3937 noah@leadboat.com 1804 :CBC 8511 : tuplesort_merge_order(int64 allowedMem)
1805 : : {
1806 : : int mOrder;
1807 : :
1808 : : /*----------
1809 : : * In the merge phase, we need buffer space for each input and output tape.
1810 : : * Each pass in the balanced merge algorithm reads from M input tapes, and
1811 : : * writes to N output tapes. Each tape consumes TAPE_BUFFER_OVERHEAD bytes
1812 : : * of memory. In addition to that, we want MERGE_BUFFER_SIZE workspace per
1813 : : * input tape.
1814 : : *
1815 : : * totalMem = M * (TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE) +
1816 : : * N * TAPE_BUFFER_OVERHEAD
1817 : : *
1818 : : * Except for the last and next-to-last merge passes, where there can be
1819 : : * fewer tapes left to process, M = N. We choose M so that we have the
1820 : : * desired amount of memory available for the input buffers
1821 : : * (TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE), given the total memory
1822 : : * available for the tape buffers (allowedMem).
1823 : : *
1824 : : * Note: you might be thinking we need to account for the memtuples[]
1825 : : * array in this calculation, but we effectively treat that as part of the
1826 : : * MERGE_BUFFER_SIZE workspace.
1827 : : *----------
1828 : : */
909 heikki.linnakangas@i 1829 : 8511 : mOrder = allowedMem /
1830 : : (2 * TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE);
1831 : :
1832 : : /*
1833 : : * Even in minimum memory, use at least a MINORDER merge. On the other
1834 : : * hand, even when we have lots of memory, do not use more than a MAXORDER
1835 : : * merge. Tapes are pretty cheap, but they're not entirely free. Each
1836 : : * additional tape reduces the amount of memory available to build runs,
1837 : : * which in turn can cause the same sort to need more runs, which makes
1838 : : * merging slower even if it can still be done in a single pass. Also,
1839 : : * high order merges are quite slow due to CPU cache effects; it can be
1840 : : * faster to pay the I/O cost of a multi-pass merge than to perform a
1841 : : * single merge pass across many hundreds of tapes.
1842 : : */
6622 tgl@sss.pgh.pa.us 1843 : 8511 : mOrder = Max(mOrder, MINORDER);
2707 rhaas@postgresql.org 1844 : 8511 : mOrder = Min(mOrder, MAXORDER);
1845 : :
6622 tgl@sss.pgh.pa.us 1846 : 8511 : return mOrder;
1847 : : }
1848 : :
1849 : : /*
1850 : : * Helper function to calculate how much memory to allocate for the read buffer
1851 : : * of each input tape in a merge pass.
1852 : : *
1853 : : * 'avail_mem' is the amount of memory available for the buffers of all the
1854 : : * tapes, both input and output.
1855 : : * 'nInputTapes' and 'nInputRuns' are the number of input tapes and runs.
1856 : : * 'maxOutputTapes' is the max. number of output tapes we should produce.
1857 : : */
1858 : : static int64
909 heikki.linnakangas@i 1859 : 144 : merge_read_buffer_size(int64 avail_mem, int nInputTapes, int nInputRuns,
1860 : : int maxOutputTapes)
1861 : : {
1862 : : int nOutputRuns;
1863 : : int nOutputTapes;
1864 : :
1865 : : /*
1866 : : * How many output tapes will we produce in this pass?
1867 : : *
1868 : : * This is nInputRuns / nInputTapes, rounded up.
1869 : : */
1870 : 144 : nOutputRuns = (nInputRuns + nInputTapes - 1) / nInputTapes;
1871 : :
1872 : 144 : nOutputTapes = Min(nOutputRuns, maxOutputTapes);
1873 : :
1874 : : /*
1875 : : * Each output tape consumes TAPE_BUFFER_OVERHEAD bytes of memory. All
1876 : : * remaining memory is divided evenly between the input tapes.
1877 : : *
1878 : : * This also follows from the formula in tuplesort_merge_order, but here
1879 : : * we derive the input buffer size from the amount of memory available,
1880 : : * and M and N.
1881 : : */
1882 : 144 : return Max((avail_mem - TAPE_BUFFER_OVERHEAD * nOutputTapes) / nInputTapes, 0);
1883 : : }
1884 : :
1885 : : /*
1886 : : * inittapes - initialize for tape sorting.
1887 : : *
1888 : : * This is called only if we have found we won't sort in memory.
1889 : : */
1890 : : static void
2263 rhaas@postgresql.org 1891 : 267 : inittapes(Tuplesortstate *state, bool mergeruns)
1892 : : {
1893 [ + + - + ]: 267 : Assert(!LEADER(state));
1894 : :
1895 [ + + ]: 267 : if (mergeruns)
1896 : : {
1897 : : /* Compute number of input tapes to use when merging */
909 heikki.linnakangas@i 1898 : 57 : state->maxTapes = tuplesort_merge_order(state->allowedMem);
1899 : : }
1900 : : else
1901 : : {
1902 : : /* Workers can sometimes produce single run, output without merge */
2263 rhaas@postgresql.org 1903 [ + - - + ]: 210 : Assert(WORKER(state));
909 heikki.linnakangas@i 1904 : 210 : state->maxTapes = MINORDER;
1905 : : }
1906 : :
1907 : : #ifdef TRACE_SORT
6768 tgl@sss.pgh.pa.us 1908 [ - + ]: 267 : if (trace_sort)
1991 pg@bowt.ie 1909 [ # # ]:UBC 0 : elog(LOG, "worker %d switching to external sort with %d tapes: %s",
1910 : : state->worker, state->maxTapes, pg_rusage_show(&state->ru_start));
1911 : : #endif
1912 : :
1913 : : /* Create the tape set */
909 heikki.linnakangas@i 1914 :CBC 267 : inittapestate(state, state->maxTapes);
2263 rhaas@postgresql.org 1915 : 267 : state->tapeset =
909 heikki.linnakangas@i 1916 : 267 : LogicalTapeSetCreate(false,
2263 rhaas@postgresql.org 1917 [ + + ]: 267 : state->shared ? &state->shared->fileset : NULL,
1918 : : state->worker);
1919 : :
2389 1920 : 267 : state->currentRun = 0;
1921 : :
1922 : : /*
1923 : : * Initialize logical tape arrays.
1924 : : */
909 heikki.linnakangas@i 1925 : 267 : state->inputTapes = NULL;
1926 : 267 : state->nInputTapes = 0;
1927 : 267 : state->nInputRuns = 0;
1928 : :
1929 : 267 : state->outputTapes = palloc0(state->maxTapes * sizeof(LogicalTape *));
1930 : 267 : state->nOutputTapes = 0;
1931 : 267 : state->nOutputRuns = 0;
1932 : :
8946 tgl@sss.pgh.pa.us 1933 : 267 : state->status = TSS_BUILDRUNS;
1934 : :
909 heikki.linnakangas@i 1935 : 267 : selectnewtape(state);
8946 tgl@sss.pgh.pa.us 1936 : 267 : }
1937 : :
1938 : : /*
1939 : : * inittapestate - initialize generic tape management state
1940 : : */
1941 : : static void
2263 rhaas@postgresql.org 1942 : 339 : inittapestate(Tuplesortstate *state, int maxTapes)
1943 : : {
1944 : : int64 tapeSpace;
1945 : :
1946 : : /*
1947 : : * Decrease availMem to reflect the space needed for tape buffers; but
1948 : : * don't decrease it to the point that we have no room for tuples. (That
1949 : : * case is only likely to occur if sorting pass-by-value Datums; in all
1950 : : * other scenarios the memtuples[] array is unlikely to occupy more than
1951 : : * half of allowedMem. In the pass-by-value case it's not important to
1952 : : * account for tuple space, so we don't care if LACKMEM becomes
1953 : : * inaccurate.)
1954 : : */
1955 : 339 : tapeSpace = (int64) maxTapes * TAPE_BUFFER_OVERHEAD;
1956 : :
1957 [ + + ]: 339 : if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem)
1958 : 291 : USEMEM(state, tapeSpace);
1959 : :
1960 : : /*
1961 : : * Make sure that the temp file(s) underlying the tape set are created in
1962 : : * suitable temp tablespaces. For parallel sorts, this should have been
1963 : : * called already, but it doesn't matter if it is called a second time.
1964 : : */
1965 : 339 : PrepareTempTablespaces();
1966 : 339 : }
1967 : :
1968 : : /*
1969 : : * selectnewtape -- select next tape to output to.
1970 : : *
1971 : : * This is called after finishing a run when we know another run
1972 : : * must be started. This is used both when building the initial
1973 : : * runs, and during merge passes.
1974 : : */
1975 : : static void
8946 tgl@sss.pgh.pa.us 1976 : 801 : selectnewtape(Tuplesortstate *state)
1977 : : {
1978 : : /*
1979 : : * At the beginning of each merge pass, nOutputTapes and nOutputRuns are
1980 : : * both zero. On each call, we create a new output tape to hold the next
1981 : : * run, until maxTapes is reached. After that, we assign new runs to the
1982 : : * existing tapes in a round robin fashion.
1983 : : */
902 heikki.linnakangas@i 1984 [ + + ]: 801 : if (state->nOutputTapes < state->maxTapes)
1985 : : {
1986 : : /* Create a new tape to hold the next run */
909 1987 [ - + ]: 510 : Assert(state->outputTapes[state->nOutputRuns] == NULL);
1988 [ - + ]: 510 : Assert(state->nOutputRuns == state->nOutputTapes);
1989 : 510 : state->destTape = LogicalTapeCreate(state->tapeset);
902 1990 : 510 : state->outputTapes[state->nOutputTapes] = state->destTape;
909 1991 : 510 : state->nOutputTapes++;
1992 : 510 : state->nOutputRuns++;
1993 : : }
1994 : : else
1995 : : {
1996 : : /*
1997 : : * We have reached the max number of tapes. Append to an existing
1998 : : * tape.
1999 : : */
2000 : 291 : state->destTape = state->outputTapes[state->nOutputRuns % state->nOutputTapes];
2001 : 291 : state->nOutputRuns++;
2002 : : }
8946 tgl@sss.pgh.pa.us 2003 : 801 : }
2004 : :
2005 : : /*
2006 : : * Initialize the slab allocation arena, for the given number of slots.
2007 : : */
2008 : : static void
2750 heikki.linnakangas@i 2009 : 129 : init_slab_allocator(Tuplesortstate *state, int numSlots)
2010 : : {
2011 [ + + ]: 129 : if (numSlots > 0)
2012 : : {
2013 : : char *p;
2014 : : int i;
2015 : :
2016 : 123 : state->slabMemoryBegin = palloc(numSlots * SLAB_SLOT_SIZE);
2017 : 123 : state->slabMemoryEnd = state->slabMemoryBegin +
2018 : 123 : numSlots * SLAB_SLOT_SIZE;
2019 : 123 : state->slabFreeHead = (SlabSlot *) state->slabMemoryBegin;
2020 : 123 : USEMEM(state, numSlots * SLAB_SLOT_SIZE);
2021 : :
2022 : 123 : p = state->slabMemoryBegin;
2023 [ + + ]: 476 : for (i = 0; i < numSlots - 1; i++)
2024 : : {
2025 : 353 : ((SlabSlot *) p)->nextfree = (SlabSlot *) (p + SLAB_SLOT_SIZE);
2026 : 353 : p += SLAB_SLOT_SIZE;
2027 : : }
2028 : 123 : ((SlabSlot *) p)->nextfree = NULL;
2029 : : }
2030 : : else
2031 : : {
2032 : 6 : state->slabMemoryBegin = state->slabMemoryEnd = NULL;
2033 : 6 : state->slabFreeHead = NULL;
2034 : : }
2035 : 129 : state->slabAllocatorUsed = true;
2036 : 129 : }
2037 : :
2038 : : /*
2039 : : * mergeruns -- merge all the completed initial runs.
2040 : : *
2041 : : * This implements the Balanced k-Way Merge Algorithm. All input data has
2042 : : * already been written to initial runs on tape (see dumptuples).
2043 : : */
2044 : : static void
8946 tgl@sss.pgh.pa.us 2045 : 129 : mergeruns(Tuplesortstate *state)
2046 : : {
2047 : : int tapenum;
2048 : :
2049 [ - + ]: 129 : Assert(state->status == TSS_BUILDRUNS);
8933 2050 [ - + ]: 129 : Assert(state->memtupcount == 0);
2051 : :
627 akorotkov@postgresql 2052 [ + + + + ]: 129 : if (state->base.sortKeys != NULL && state->base.sortKeys->abbrev_converter != NULL)
2053 : : {
2054 : : /*
2055 : : * If there are multiple runs to be merged, when we go to read back
2056 : : * tuples from disk, abbreviated keys will not have been stored, and
2057 : : * we don't care to regenerate them. Disable abbreviation from this
2058 : : * point on.
2059 : : */
2060 : 15 : state->base.sortKeys->abbrev_converter = NULL;
2061 : 15 : state->base.sortKeys->comparator = state->base.sortKeys->abbrev_full_comparator;
2062 : :
2063 : : /* Not strictly necessary, but be tidy */
2064 : 15 : state->base.sortKeys->abbrev_abort = NULL;
2065 : 15 : state->base.sortKeys->abbrev_full_comparator = NULL;
2066 : : }
2067 : :
2068 : : /*
2069 : : * Reset tuple memory. We've freed all the tuples that we previously
2070 : : * allocated. We will use the slab allocator from now on.
2071 : : */
2072 : 129 : MemoryContextResetOnly(state->base.tuplecontext);
2073 : :
2074 : : /*
2075 : : * We no longer need a large memtuples array. (We will allocate a smaller
2076 : : * one for the heap later.)
2077 : : */
2750 heikki.linnakangas@i 2078 : 129 : FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
2079 : 129 : pfree(state->memtuples);
2080 : 129 : state->memtuples = NULL;
2081 : :
2082 : : /*
2083 : : * Initialize the slab allocator. We need one slab slot per input tape,
2084 : : * for the tuples in the heap, plus one to hold the tuple last returned
2085 : : * from tuplesort_gettuple. (If we're sorting pass-by-val Datums,
2086 : : * however, we don't need to do allocate anything.)
2087 : : *
2088 : : * In a multi-pass merge, we could shrink this allocation for the last
2089 : : * merge pass, if it has fewer tapes than previous passes, but we don't
2090 : : * bother.
2091 : : *
2092 : : * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism
2093 : : * to track memory usage of individual tuples.
2094 : : */
627 akorotkov@postgresql 2095 [ + + ]: 129 : if (state->base.tuples)
909 heikki.linnakangas@i 2096 : 123 : init_slab_allocator(state, state->nOutputTapes + 1);
2097 : : else
2750 2098 : 6 : init_slab_allocator(state, 0);
2099 : :
2100 : : /*
2101 : : * Allocate a new 'memtuples' array, for the heap. It will hold one tuple
2102 : : * from each input tape.
2103 : : *
2104 : : * We could shrink this, too, between passes in a multi-pass merge, but we
2105 : : * don't bother. (The initial input tapes are still in outputTapes. The
2106 : : * number of input tapes will not increase between passes.)
2107 : : */
909 2108 : 129 : state->memtupsize = state->nOutputTapes;
627 akorotkov@postgresql 2109 : 258 : state->memtuples = (SortTuple *) MemoryContextAlloc(state->base.maincontext,
909 heikki.linnakangas@i 2110 : 129 : state->nOutputTapes * sizeof(SortTuple));
2684 2111 : 129 : USEMEM(state, GetMemoryChunkSpace(state->memtuples));
2112 : :
2113 : : /*
2114 : : * Use all the remaining memory we have available for tape buffers among
2115 : : * all the input tapes. At the beginning of each merge pass, we will
2116 : : * divide this memory between the input and output tapes in the pass.
2117 : : */
909 2118 : 129 : state->tape_buffer_mem = state->availMem;
902 2119 : 129 : USEMEM(state, state->tape_buffer_mem);
2120 : : #ifdef TRACE_SORT
2741 2121 [ - + ]: 129 : if (trace_sort)
909 heikki.linnakangas@i 2122 [ # # ]:UBC 0 : elog(LOG, "worker %d using %zu KB of memory for tape buffers",
2123 : : state->worker, state->tape_buffer_mem / 1024);
2124 : : #endif
2125 : :
2126 : : for (;;)
2127 : : {
2128 : : /*
2129 : : * On the first iteration, or if we have read all the runs from the
2130 : : * input tapes in a multi-pass merge, it's time to start a new pass.
2131 : : * Rewind all the output tapes, and make them inputs for the next
2132 : : * pass.
2133 : : */
909 heikki.linnakangas@i 2134 [ + + ]:CBC 198 : if (state->nInputRuns == 0)
2135 : : {
2136 : : int64 input_buffer_size;
2137 : :
2138 : : /* Close the old, emptied, input tapes */
2139 [ + + ]: 144 : if (state->nInputTapes > 0)
2140 : : {
2141 [ + + ]: 105 : for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)
2142 : 90 : LogicalTapeClose(state->inputTapes[tapenum]);
2143 : 15 : pfree(state->inputTapes);
2144 : : }
2145 : :
2146 : : /* Previous pass's outputs become next pass's inputs. */
2147 : 144 : state->inputTapes = state->outputTapes;
2148 : 144 : state->nInputTapes = state->nOutputTapes;
2149 : 144 : state->nInputRuns = state->nOutputRuns;
2150 : :
2151 : : /*
2152 : : * Reset output tape variables. The actual LogicalTapes will be
2153 : : * created as needed, here we only allocate the array to hold
2154 : : * them.
2155 : : */
2156 : 144 : state->outputTapes = palloc0(state->nInputTapes * sizeof(LogicalTape *));
2157 : 144 : state->nOutputTapes = 0;
2158 : 144 : state->nOutputRuns = 0;
2159 : :
2160 : : /*
2161 : : * Redistribute the memory allocated for tape buffers, among the
2162 : : * new input and output tapes.
2163 : : */
2164 : 144 : input_buffer_size = merge_read_buffer_size(state->tape_buffer_mem,
2165 : : state->nInputTapes,
2166 : : state->nInputRuns,
2167 : : state->maxTapes);
2168 : :
2169 : : #ifdef TRACE_SORT
2170 [ - + ]: 144 : if (trace_sort)
909 heikki.linnakangas@i 2171 [ # # ]:UBC 0 : elog(LOG, "starting merge pass of %d input runs on %d tapes, " INT64_FORMAT " KB of memory for each input tape: %s",
2172 : : state->nInputRuns, state->nInputTapes, input_buffer_size / 1024,
2173 : : pg_rusage_show(&state->ru_start));
2174 : : #endif
2175 : :
2176 : : /* Prepare the new input tapes for merge pass. */
909 heikki.linnakangas@i 2177 [ + + ]:CBC 581 : for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)
2178 : 437 : LogicalTapeRewindForRead(state->inputTapes[tapenum], input_buffer_size);
2179 : :
2180 : : /*
2181 : : * If there's just one run left on each input tape, then only one
2182 : : * merge pass remains. If we don't have to produce a materialized
2183 : : * sorted tape, we can stop at this point and do the final merge
2184 : : * on-the-fly.
2185 : : */
627 akorotkov@postgresql 2186 [ + + ]: 144 : if ((state->base.sortopt & TUPLESORT_RANDOMACCESS) == 0
741 drowley@postgresql.o 2187 [ + + ]: 135 : && state->nInputRuns <= state->nInputTapes
909 heikki.linnakangas@i 2188 [ + + + - ]: 120 : && !WORKER(state))
2189 : : {
2190 : : /* Tell logtape.c we won't be writing anymore */
6613 tgl@sss.pgh.pa.us 2191 : 120 : LogicalTapeSetForgetFreeSpace(state->tapeset);
2192 : : /* Initialize for the final merge pass */
2750 heikki.linnakangas@i 2193 : 120 : beginmerge(state);
8946 tgl@sss.pgh.pa.us 2194 : 120 : state->status = TSS_FINALMERGE;
2195 : 120 : return;
2196 : : }
2197 : : }
2198 : :
2199 : : /* Select an output tape */
909 heikki.linnakangas@i 2200 : 78 : selectnewtape(state);
2201 : :
2202 : : /* Merge one run from each input tape. */
2203 : 78 : mergeonerun(state);
2204 : :
2205 : : /*
2206 : : * If the input tapes are empty, and we output only one output run,
2207 : : * we're done. The current output tape contains the final result.
2208 : : */
2209 [ + + + + ]: 78 : if (state->nInputRuns == 0 && state->nOutputRuns <= 1)
2210 : 9 : break;
2211 : : }
2212 : :
2213 : : /*
2214 : : * Done. The result is on a single run on a single tape.
2215 : : */
2216 : 9 : state->result_tape = state->outputTapes[0];
2263 rhaas@postgresql.org 2217 [ - + - - ]: 9 : if (!WORKER(state))
909 heikki.linnakangas@i 2218 : 9 : LogicalTapeFreeze(state->result_tape, NULL);
2219 : : else
2263 rhaas@postgresql.org 2220 :LBC (1) : worker_freeze_result_tape(state);
8946 tgl@sss.pgh.pa.us 2221 :CBC 9 : state->status = TSS_SORTEDONTAPE;
2222 : :
2223 : : /* Close all the now-empty input tapes, to release their read buffers. */
909 heikki.linnakangas@i 2224 [ + + ]: 51 : for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)
2225 : 42 : LogicalTapeClose(state->inputTapes[tapenum]);
2226 : : }
2227 : :
2228 : : /*
2229 : : * Merge one run from each input tape.
2230 : : */
2231 : : static void
8946 tgl@sss.pgh.pa.us 2232 : 78 : mergeonerun(Tuplesortstate *state)
2233 : : {
2234 : : int srcTapeIndex;
2235 : : LogicalTape *srcTape;
2236 : :
2237 : : /*
2238 : : * Start the merge by loading one tuple from each active source tape into
2239 : : * the heap.
2240 : : */
2750 heikki.linnakangas@i 2241 : 78 : beginmerge(state);
2242 : :
591 drowley@postgresql.o 2243 [ - + ]: 78 : Assert(state->slabAllocatorUsed);
2244 : :
2245 : : /*
2246 : : * Execute merge by repeatedly extracting lowest tuple in heap, writing it
2247 : : * out, and replacing it with next tuple from same tape (if there is
2248 : : * another one).
2249 : : */
8933 tgl@sss.pgh.pa.us 2250 [ + + ]: 427794 : while (state->memtupcount > 0)
2251 : : {
2252 : : SortTuple stup;
2253 : :
2254 : : /* write the tuple to destTape */
909 heikki.linnakangas@i 2255 : 427716 : srcTapeIndex = state->memtuples[0].srctape;
2256 : 427716 : srcTape = state->inputTapes[srcTapeIndex];
2257 : 427716 : WRITETUP(state, state->destTape, &state->memtuples[0]);
2258 : :
2259 : : /* recycle the slot of the tuple we just wrote out, for the next read */
2645 tgl@sss.pgh.pa.us 2260 [ + + ]: 427716 : if (state->memtuples[0].tuple)
2261 [ + - + - ]: 367674 : RELEASE_SLAB_SLOT(state, state->memtuples[0].tuple);
2262 : :
2263 : : /*
2264 : : * pull next tuple from the tape, and replace the written-out tuple in
2265 : : * the heap with it.
2266 : : */
2750 heikki.linnakangas@i 2267 [ + + ]: 427716 : if (mergereadnext(state, srcTape, &stup))
2268 : : {
909 2269 : 427293 : stup.srctape = srcTapeIndex;
2389 rhaas@postgresql.org 2270 : 427293 : tuplesort_heap_replace_top(state, &stup);
2271 : : }
2272 : : else
2273 : : {
2274 : 423 : tuplesort_heap_delete_top(state);
909 heikki.linnakangas@i 2275 : 423 : state->nInputRuns--;
2276 : : }
2277 : : }
2278 : :
2279 : : /*
2280 : : * When the heap empties, we're done. Write an end-of-run marker on the
2281 : : * output tape.
2282 : : */
2283 : 78 : markrunend(state->destTape);
8946 tgl@sss.pgh.pa.us 2284 : 78 : }
2285 : :
2286 : : /*
2287 : : * beginmerge - initialize for a merge pass
2288 : : *
2289 : : * Fill the merge heap with the first tuple from each input tape.
2290 : : */
2291 : : static void
2750 heikki.linnakangas@i 2292 : 198 : beginmerge(Tuplesortstate *state)
2293 : : {
2294 : : int activeTapes;
2295 : : int srcTapeIndex;
2296 : :
2297 : : /* Heap should be empty here */
8933 tgl@sss.pgh.pa.us 2298 [ - + ]: 198 : Assert(state->memtupcount == 0);
2299 : :
909 heikki.linnakangas@i 2300 : 198 : activeTapes = Min(state->nInputTapes, state->nInputRuns);
2301 : :
2302 [ + + ]: 926 : for (srcTapeIndex = 0; srcTapeIndex < activeTapes; srcTapeIndex++)
2303 : : {
2304 : : SortTuple tup;
2305 : :
2306 [ + + ]: 728 : if (mergereadnext(state, state->inputTapes[srcTapeIndex], &tup))
2307 : : {
2308 : 620 : tup.srctape = srcTapeIndex;
2389 rhaas@postgresql.org 2309 : 620 : tuplesort_heap_insert(state, &tup);
2310 : : }
2311 : : }
2950 2312 : 198 : }
2313 : :
2314 : : /*
2315 : : * mergereadnext - read next tuple from one merge input tape
2316 : : *
2317 : : * Returns false on EOF.
2318 : : */
2319 : : static bool
909 heikki.linnakangas@i 2320 : 2706118 : mergereadnext(Tuplesortstate *state, LogicalTape *srcTape, SortTuple *stup)
2321 : : {
2322 : : unsigned int tuplen;
2323 : :
2324 : : /* read next tuple, if any */
2325 [ + + ]: 2706118 : if ((tuplen = getlen(srcTape, true)) == 0)
2750 2326 : 704 : return false;
2327 : 2705414 : READTUP(state, stup, srcTape, tuplen);
2328 : :
2329 : 2705414 : return true;
2330 : : }
2331 : :
2332 : : /*
2333 : : * dumptuples - remove tuples from memtuples and write initial run to tape
2334 : : *
2335 : : * When alltuples = true, dump everything currently in memory. (This case is
2336 : : * only used at end of input data.)
2337 : : */
2338 : : static void
8946 tgl@sss.pgh.pa.us 2339 : 530760 : dumptuples(Tuplesortstate *state, bool alltuples)
2340 : : {
2341 : : int memtupwrite;
2342 : : int i;
2343 : :
2344 : : /*
2345 : : * Nothing to do if we still fit in available memory and have array slots,
2346 : : * unless this is the final call during initial run generation.
2347 : : */
2389 rhaas@postgresql.org 2348 [ + + + + : 530760 : if (state->memtupcount < state->memtupsize && !LACKMEM(state) &&
- + ]
2349 [ + + ]: 530304 : !alltuples)
2350 : 530037 : return;
2351 : :
2352 : : /*
2353 : : * Final call might require no sorting, in rare cases where we just so
2354 : : * happen to have previously LACKMEM()'d at the point where exactly all
2355 : : * remaining tuples are loaded into memory, just before input was
2356 : : * exhausted. In general, short final runs are quite possible, but avoid
2357 : : * creating a completely empty run. In a worker, though, we must produce
2358 : : * at least one tape, even if it's empty.
2359 : : */
909 heikki.linnakangas@i 2360 [ + + - + ]: 723 : if (state->memtupcount == 0 && state->currentRun > 0)
909 heikki.linnakangas@i 2361 :UBC 0 : return;
2362 : :
2928 rhaas@postgresql.org 2363 [ - + ]:CBC 723 : Assert(state->status == TSS_BUILDRUNS);
2364 : :
2365 : : /*
2366 : : * It seems unlikely that this limit will ever be exceeded, but take no
2367 : : * chances
2368 : : */
2369 [ - + ]: 723 : if (state->currentRun == INT_MAX)
2928 rhaas@postgresql.org 2370 [ # # ]:UBC 0 : ereport(ERROR,
2371 : : (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
2372 : : errmsg("cannot have more than %d runs for an external sort",
2373 : : INT_MAX)));
2374 : :
909 heikki.linnakangas@i 2375 [ + + ]:CBC 723 : if (state->currentRun > 0)
2376 : 456 : selectnewtape(state);
2377 : :
2928 rhaas@postgresql.org 2378 : 723 : state->currentRun++;
2379 : :
2380 : : #ifdef TRACE_SORT
2381 [ - + ]: 723 : if (trace_sort)
1991 pg@bowt.ie 2382 [ # # ]:UBC 0 : elog(LOG, "worker %d starting quicksort of run %d: %s",
2383 : : state->worker, state->currentRun,
2384 : : pg_rusage_show(&state->ru_start));
2385 : : #endif
2386 : :
2387 : : /*
2388 : : * Sort all tuples accumulated within the allowed amount of memory for
2389 : : * this run using quicksort
2390 : : */
2928 rhaas@postgresql.org 2391 :CBC 723 : tuplesort_sort_memtuples(state);
2392 : :
2393 : : #ifdef TRACE_SORT
2394 [ - + ]: 723 : if (trace_sort)
1991 pg@bowt.ie 2395 [ # # ]:UBC 0 : elog(LOG, "worker %d finished quicksort of run %d: %s",
2396 : : state->worker, state->currentRun,
2397 : : pg_rusage_show(&state->ru_start));
2398 : : #endif
2399 : :
2928 rhaas@postgresql.org 2400 :CBC 723 : memtupwrite = state->memtupcount;
2401 [ + + ]: 2526023 : for (i = 0; i < memtupwrite; i++)
2402 : : {
591 drowley@postgresql.o 2403 : 2525300 : SortTuple *stup = &state->memtuples[i];
2404 : :
2405 : 2525300 : WRITETUP(state, state->destTape, stup);
2406 : : }
2407 : :
2408 : 723 : state->memtupcount = 0;
2409 : :
2410 : : /*
2411 : : * Reset tuple memory. We've freed all of the tuples that we previously
2412 : : * allocated. It's important to avoid fragmentation when there is a stark
2413 : : * change in the sizes of incoming tuples. In bounded sorts,
2414 : : * fragmentation due to AllocSetFree's bucketing by size class might be
2415 : : * particularly bad if this step wasn't taken.
2416 : : */
627 akorotkov@postgresql 2417 : 723 : MemoryContextReset(state->base.tuplecontext);
2418 : :
2419 : : /*
2420 : : * Now update the memory accounting to subtract the memory used by the
2421 : : * tuple.
2422 : : */
6 drowley@postgresql.o 2423 :GNC 723 : FREEMEM(state, state->tupleMem);
2424 : 723 : state->tupleMem = 0;
2425 : :
909 heikki.linnakangas@i 2426 :CBC 723 : markrunend(state->destTape);
2427 : :
2428 : : #ifdef TRACE_SORT
2928 rhaas@postgresql.org 2429 [ - + ]: 723 : if (trace_sort)
1991 pg@bowt.ie 2430 [ # # ]:UBC 0 : elog(LOG, "worker %d finished writing run %d to tape %d: %s",
2431 : : state->worker, state->currentRun, (state->currentRun - 1) % state->nOutputTapes + 1,
2432 : : pg_rusage_show(&state->ru_start));
2433 : : #endif
2434 : : }
2435 : :
2436 : : /*
2437 : : * tuplesort_rescan - rewind and replay the scan
2438 : : */
2439 : : void
8946 tgl@sss.pgh.pa.us 2440 :CBC 29 : tuplesort_rescan(Tuplesortstate *state)
2441 : : {
627 akorotkov@postgresql 2442 : 29 : MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
2443 : :
2444 [ - + ]: 29 : Assert(state->base.sortopt & TUPLESORT_RANDOMACCESS);
2445 : :
8946 tgl@sss.pgh.pa.us 2446 [ + + - ]: 29 : switch (state->status)
2447 : : {
2448 : 26 : case TSS_SORTEDINMEM:
2449 : 26 : state->current = 0;
2450 : 26 : state->eof_reached = false;
2451 : 26 : state->markpos_offset = 0;
2452 : 26 : state->markpos_eof = false;
2453 : 26 : break;
2454 : 3 : case TSS_SORTEDONTAPE:
909 heikki.linnakangas@i 2455 : 3 : LogicalTapeRewindForRead(state->result_tape, 0);
8946 tgl@sss.pgh.pa.us 2456 : 3 : state->eof_reached = false;
2457 : 3 : state->markpos_block = 0L;
2458 : 3 : state->markpos_offset = 0;
2459 : 3 : state->markpos_eof = false;
2460 : 3 : break;
8946 tgl@sss.pgh.pa.us 2461 :UBC 0 : default:
7569 2462 [ # # ]: 0 : elog(ERROR, "invalid tuplesort state");
2463 : : break;
2464 : : }
2465 : :
6622 tgl@sss.pgh.pa.us 2466 :CBC 29 : MemoryContextSwitchTo(oldcontext);
8946 2467 : 29 : }
2468 : :
2469 : : /*
2470 : : * tuplesort_markpos - saves current position in the merged sort file
2471 : : */
2472 : : void
2473 : 289805 : tuplesort_markpos(Tuplesortstate *state)
2474 : : {
627 akorotkov@postgresql 2475 : 289805 : MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
2476 : :
2477 [ - + ]: 289805 : Assert(state->base.sortopt & TUPLESORT_RANDOMACCESS);
2478 : :
8946 tgl@sss.pgh.pa.us 2479 [ + + - ]: 289805 : switch (state->status)
2480 : : {
2481 : 285401 : case TSS_SORTEDINMEM:
2482 : 285401 : state->markpos_offset = state->current;
2483 : 285401 : state->markpos_eof = state->eof_reached;
2484 : 285401 : break;
2485 : 4404 : case TSS_SORTEDONTAPE:
909 heikki.linnakangas@i 2486 : 4404 : LogicalTapeTell(state->result_tape,
2487 : : &state->markpos_block,
2488 : : &state->markpos_offset);
8946 tgl@sss.pgh.pa.us 2489 : 4404 : state->markpos_eof = state->eof_reached;
2490 : 4404 : break;
8946 tgl@sss.pgh.pa.us 2491 :UBC 0 : default:
7569 2492 [ # # ]: 0 : elog(ERROR, "invalid tuplesort state");
2493 : : break;
2494 : : }
2495 : :
6622 tgl@sss.pgh.pa.us 2496 :CBC 289805 : MemoryContextSwitchTo(oldcontext);
8946 2497 : 289805 : }
2498 : :
2499 : : /*
2500 : : * tuplesort_restorepos - restores current position in merged sort file to
2501 : : * last saved position
2502 : : */
2503 : : void
2504 : 15975 : tuplesort_restorepos(Tuplesortstate *state)
2505 : : {
627 akorotkov@postgresql 2506 : 15975 : MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
2507 : :
2508 [ - + ]: 15975 : Assert(state->base.sortopt & TUPLESORT_RANDOMACCESS);
2509 : :
8946 tgl@sss.pgh.pa.us 2510 [ + + - ]: 15975 : switch (state->status)
2511 : : {
2512 : 12879 : case TSS_SORTEDINMEM:
2513 : 12879 : state->current = state->markpos_offset;
2514 : 12879 : state->eof_reached = state->markpos_eof;
2515 : 12879 : break;
2516 : 3096 : case TSS_SORTEDONTAPE:
909 heikki.linnakangas@i 2517 : 3096 : LogicalTapeSeek(state->result_tape,
2518 : : state->markpos_block,
2519 : : state->markpos_offset);
8946 tgl@sss.pgh.pa.us 2520 : 3096 : state->eof_reached = state->markpos_eof;
2521 : 3096 : break;
8946 tgl@sss.pgh.pa.us 2522 :UBC 0 : default:
7569 2523 [ # # ]: 0 : elog(ERROR, "invalid tuplesort state");
2524 : : break;
2525 : : }
2526 : :
6622 tgl@sss.pgh.pa.us 2527 :CBC 15975 : MemoryContextSwitchTo(oldcontext);
8946 2528 : 15975 : }
2529 : :
2530 : : /*
2531 : : * tuplesort_get_stats - extract summary statistics
2532 : : *
2533 : : * This can be called after tuplesort_performsort() finishes to obtain
2534 : : * printable summary information about how the sort was performed.
2535 : : */
2536 : : void
5361 2537 : 192 : tuplesort_get_stats(Tuplesortstate *state,
2538 : : TuplesortInstrumentation *stats)
2539 : : {
2540 : : /*
2541 : : * Note: it might seem we should provide both memory and disk usage for a
2542 : : * disk-based sort. However, the current code doesn't track memory space
2543 : : * accurately once we have begun to return tuples to the caller (since we
2544 : : * don't account for pfree's the caller is expected to do), so we cannot
2545 : : * rely on availMem in a disk sort. This does not seem worth the overhead
2546 : : * to fix. Is it worth creating an API for the memory context code to
2547 : : * tell us how much is actually used in sortcontext?
2548 : : */
1469 tomas.vondra@postgre 2549 : 192 : tuplesort_updatemax(state);
2550 : :
2551 [ - + ]: 192 : if (state->isMaxSpaceDisk)
2420 rhaas@postgresql.org 2552 :UBC 0 : stats->spaceType = SORT_SPACE_TYPE_DISK;
2553 : : else
2420 rhaas@postgresql.org 2554 :CBC 192 : stats->spaceType = SORT_SPACE_TYPE_MEMORY;
1469 tomas.vondra@postgre 2555 : 192 : stats->spaceUsed = (state->maxSpace + 1023) / 1024;
2556 : :
2557 [ + - - - ]: 192 : switch (state->maxSpaceStatus)
2558 : : {
6190 tgl@sss.pgh.pa.us 2559 : 192 : case TSS_SORTEDINMEM:
2560 [ + + ]: 192 : if (state->boundUsed)
2420 rhaas@postgresql.org 2561 : 21 : stats->sortMethod = SORT_TYPE_TOP_N_HEAPSORT;
2562 : : else
2563 : 171 : stats->sortMethod = SORT_TYPE_QUICKSORT;
6190 tgl@sss.pgh.pa.us 2564 : 192 : break;
6190 tgl@sss.pgh.pa.us 2565 :UBC 0 : case TSS_SORTEDONTAPE:
2420 rhaas@postgresql.org 2566 : 0 : stats->sortMethod = SORT_TYPE_EXTERNAL_SORT;
6190 tgl@sss.pgh.pa.us 2567 : 0 : break;
2568 : 0 : case TSS_FINALMERGE:
2420 rhaas@postgresql.org 2569 : 0 : stats->sortMethod = SORT_TYPE_EXTERNAL_MERGE;
6190 tgl@sss.pgh.pa.us 2570 : 0 : break;
2571 : 0 : default:
2420 rhaas@postgresql.org 2572 : 0 : stats->sortMethod = SORT_TYPE_STILL_IN_PROGRESS;
6190 tgl@sss.pgh.pa.us 2573 : 0 : break;
2574 : : }
6190 tgl@sss.pgh.pa.us 2575 :CBC 192 : }
2576 : :
2577 : : /*
2578 : : * Convert TuplesortMethod to a string.
2579 : : */
2580 : : const char *
2420 rhaas@postgresql.org 2581 : 141 : tuplesort_method_name(TuplesortMethod m)
2582 : : {
2583 [ - + + - : 141 : switch (m)
- - ]
2584 : : {
2420 rhaas@postgresql.org 2585 :UBC 0 : case SORT_TYPE_STILL_IN_PROGRESS:
2586 : 0 : return "still in progress";
2420 rhaas@postgresql.org 2587 :CBC 21 : case SORT_TYPE_TOP_N_HEAPSORT:
2588 : 21 : return "top-N heapsort";
2589 : 120 : case SORT_TYPE_QUICKSORT:
2590 : 120 : return "quicksort";
2420 rhaas@postgresql.org 2591 :UBC 0 : case SORT_TYPE_EXTERNAL_SORT:
2592 : 0 : return "external sort";
2593 : 0 : case SORT_TYPE_EXTERNAL_MERGE:
2594 : 0 : return "external merge";
2595 : : }
2596 : :
2597 : 0 : return "unknown";
2598 : : }
2599 : :
2600 : : /*
2601 : : * Convert TuplesortSpaceType to a string.
2602 : : */
2603 : : const char *
2420 rhaas@postgresql.org 2604 :CBC 123 : tuplesort_space_type_name(TuplesortSpaceType t)
2605 : : {
2606 [ + - - + ]: 123 : Assert(t == SORT_SPACE_TYPE_DISK || t == SORT_SPACE_TYPE_MEMORY);
2607 [ - + ]: 123 : return t == SORT_SPACE_TYPE_DISK ? "Disk" : "Memory";
2608 : : }
2609 : :
2610 : :
2611 : : /*
2612 : : * Heap manipulation routines, per Knuth's Algorithm 5.2.3H.
2613 : : */
2614 : :
2615 : : /*
2616 : : * Convert the existing unordered array of SortTuples to a bounded heap,
2617 : : * discarding all but the smallest "state->bound" tuples.
2618 : : *
2619 : : * When working with a bounded heap, we want to keep the largest entry
2620 : : * at the root (array entry zero), instead of the smallest as in the normal
2621 : : * sort case. This allows us to discard the largest entry cheaply.
2622 : : * Therefore, we temporarily reverse the sort direction.
2623 : : */
2624 : : static void
6190 tgl@sss.pgh.pa.us 2625 : 213 : make_bounded_heap(Tuplesortstate *state)
2626 : : {
5995 bruce@momjian.us 2627 : 213 : int tupcount = state->memtupcount;
2628 : : int i;
2629 : :
6190 tgl@sss.pgh.pa.us 2630 [ - + ]: 213 : Assert(state->status == TSS_INITIAL);
2631 [ - + ]: 213 : Assert(state->bounded);
2632 [ - + ]: 213 : Assert(tupcount >= state->bound);
2263 rhaas@postgresql.org 2633 [ - + ]: 213 : Assert(SERIAL(state));
2634 : :
2635 : : /* Reverse sort direction so largest entry will be at root */
3446 2636 : 213 : reversedirection(state);
2637 : :
6190 tgl@sss.pgh.pa.us 2638 : 213 : state->memtupcount = 0; /* make the heap empty */
5995 bruce@momjian.us 2639 [ + + ]: 23068 : for (i = 0; i < tupcount; i++)
2640 : : {
2772 heikki.linnakangas@i 2641 [ + + ]: 22855 : if (state->memtupcount < state->bound)
2642 : : {
2643 : : /* Insert next tuple into heap */
2644 : : /* Must copy source tuple to avoid possible overwrite */
5995 bruce@momjian.us 2645 : 11321 : SortTuple stup = state->memtuples[i];
2646 : :
2389 rhaas@postgresql.org 2647 : 11321 : tuplesort_heap_insert(state, &stup);
2648 : : }
2649 : : else
2650 : : {
2651 : : /*
2652 : : * The heap is full. Replace the largest entry with the new
2653 : : * tuple, or just discard it, if it's larger than anything already
2654 : : * in the heap.
2655 : : */
2772 heikki.linnakangas@i 2656 [ + + ]: 11534 : if (COMPARETUP(state, &state->memtuples[i], &state->memtuples[0]) <= 0)
2657 : : {
2658 : 5523 : free_sort_tuple(state, &state->memtuples[i]);
2659 [ - + ]: 5523 : CHECK_FOR_INTERRUPTS();
2660 : : }
2661 : : else
2389 rhaas@postgresql.org 2662 : 6011 : tuplesort_heap_replace_top(state, &state->memtuples[i]);
2663 : : }
2664 : : }
2665 : :
6190 tgl@sss.pgh.pa.us 2666 [ - + ]: 213 : Assert(state->memtupcount == state->bound);
2667 : 213 : state->status = TSS_BOUNDED;
2668 : 213 : }
2669 : :
2670 : : /*
2671 : : * Convert the bounded heap to a properly-sorted array
2672 : : */
2673 : : static void
2674 : 213 : sort_bounded_heap(Tuplesortstate *state)
2675 : : {
5995 bruce@momjian.us 2676 : 213 : int tupcount = state->memtupcount;
2677 : :
6190 tgl@sss.pgh.pa.us 2678 [ - + ]: 213 : Assert(state->status == TSS_BOUNDED);
2679 [ - + ]: 213 : Assert(state->bounded);
2680 [ - + ]: 213 : Assert(tupcount == state->bound);
2263 rhaas@postgresql.org 2681 [ - + ]: 213 : Assert(SERIAL(state));
2682 : :
2683 : : /*
2684 : : * We can unheapify in place because each delete-top call will remove the
2685 : : * largest entry, which we can promptly store in the newly freed slot at
2686 : : * the end. Once we're down to a single-entry heap, we're done.
2687 : : */
6190 tgl@sss.pgh.pa.us 2688 [ + + ]: 11321 : while (state->memtupcount > 1)
2689 : : {
5995 bruce@momjian.us 2690 : 11108 : SortTuple stup = state->memtuples[0];
2691 : :
2692 : : /* this sifts-up the next-largest entry and decreases memtupcount */
2389 rhaas@postgresql.org 2693 : 11108 : tuplesort_heap_delete_top(state);
6190 tgl@sss.pgh.pa.us 2694 : 11108 : state->memtuples[state->memtupcount] = stup;
2695 : : }
2696 : 213 : state->memtupcount = tupcount;
2697 : :
2698 : : /*
2699 : : * Reverse sort direction back to the original state. This is not
2700 : : * actually necessary but seems like a good idea for tidiness.
2701 : : */
3446 rhaas@postgresql.org 2702 : 213 : reversedirection(state);
2703 : :
6190 tgl@sss.pgh.pa.us 2704 : 213 : state->status = TSS_SORTEDINMEM;
2705 : 213 : state->boundUsed = true;
2706 : 213 : }
2707 : :
2708 : : /*
2709 : : * Sort all memtuples using specialized qsort() routines.
2710 : : *
2711 : : * Quicksort is used for small in-memory sorts, and external sort runs.
2712 : : */
2713 : : static void
2928 rhaas@postgresql.org 2714 : 100306 : tuplesort_sort_memtuples(Tuplesortstate *state)
2715 : : {
2263 2716 [ + + - + ]: 100306 : Assert(!LEADER(state));
2717 : :
2928 2718 [ + + ]: 100306 : if (state->memtupcount > 1)
2719 : : {
2720 : : /*
2721 : : * Do we have the leading column's value or abbreviation in datum1,
2722 : : * and is there a specialization for its comparator?
2723 : : */
627 akorotkov@postgresql 2724 [ + + + + ]: 30921 : if (state->base.haveDatum1 && state->base.sortKeys)
2725 : : {
2726 [ + + ]: 30901 : if (state->base.sortKeys[0].comparator == ssup_datum_unsigned_cmp)
2727 : : {
741 tmunro@postgresql.or 2728 : 1466 : qsort_tuple_unsigned(state->memtuples,
2729 : 1466 : state->memtupcount,
2730 : : state);
2731 : 1458 : return;
2732 : : }
2733 : : #if SIZEOF_DATUM >= 8
627 akorotkov@postgresql 2734 [ + + ]: 29435 : else if (state->base.sortKeys[0].comparator == ssup_datum_signed_cmp)
2735 : : {
741 tmunro@postgresql.or 2736 : 527 : qsort_tuple_signed(state->memtuples,
2737 : 527 : state->memtupcount,
2738 : : state);
2739 : 527 : return;
2740 : : }
2741 : : #endif
627 akorotkov@postgresql 2742 [ + + ]: 28908 : else if (state->base.sortKeys[0].comparator == ssup_datum_int32_cmp)
2743 : : {
741 tmunro@postgresql.or 2744 : 19199 : qsort_tuple_int32(state->memtuples,
2745 : 19199 : state->memtupcount,
2746 : : state);
2747 : 19169 : return;
2748 : : }
2749 : : }
2750 : :
2751 : : /* Can we use the single-key sort function? */
627 akorotkov@postgresql 2752 [ + + ]: 9729 : if (state->base.onlyKey != NULL)
2753 : : {
2928 rhaas@postgresql.org 2754 : 4116 : qsort_ssup(state->memtuples, state->memtupcount,
627 akorotkov@postgresql 2755 : 4116 : state->base.onlyKey);
2756 : : }
2757 : : else
2758 : : {
2928 rhaas@postgresql.org 2759 : 5613 : qsort_tuple(state->memtuples,
2760 : 5613 : state->memtupcount,
2761 : : state->base.comparetup,
2762 : : state);
2763 : : }
2764 : : }
2765 : : }
2766 : :
2767 : : /*
2768 : : * Insert a new tuple into an empty or existing heap, maintaining the
2769 : : * heap invariant. Caller is responsible for ensuring there's room.
2770 : : *
2771 : : * Note: For some callers, tuple points to a memtuples[] entry above the
2772 : : * end of the heap. This is safe as long as it's not immediately adjacent
2773 : : * to the end of the heap (ie, in the [memtupcount] array entry) --- if it
2774 : : * is, it might get overwritten before being moved into the heap!
2775 : : */
2776 : : static void
2389 2777 : 11941 : tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple)
2778 : : {
2779 : : SortTuple *memtuples;
2780 : : int j;
2781 : :
8933 tgl@sss.pgh.pa.us 2782 : 11941 : memtuples = state->memtuples;
6622 2783 [ - + ]: 11941 : Assert(state->memtupcount < state->memtupsize);
2784 : :
4408 rhaas@postgresql.org 2785 [ + + ]: 11941 : CHECK_FOR_INTERRUPTS();
2786 : :
2787 : : /*
2788 : : * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is
2789 : : * using 1-based array indexes, not 0-based.
2790 : : */
8933 tgl@sss.pgh.pa.us 2791 : 11941 : j = state->memtupcount++;
2792 [ + + ]: 33245 : while (j > 0)
2793 : : {
8768 bruce@momjian.us 2794 : 29834 : int i = (j - 1) >> 1;
2795 : :
2389 rhaas@postgresql.org 2796 [ + + ]: 29834 : if (COMPARETUP(state, tuple, &memtuples[i]) >= 0)
8946 tgl@sss.pgh.pa.us 2797 : 8530 : break;
8933 2798 : 21304 : memtuples[j] = memtuples[i];
8946 2799 : 21304 : j = i;
2800 : : }
6622 2801 : 11941 : memtuples[j] = *tuple;
8946 2802 : 11941 : }
2803 : :
2804 : : /*
2805 : : * Remove the tuple at state->memtuples[0] from the heap. Decrement
2806 : : * memtupcount, and sift up to maintain the heap invariant.
2807 : : *
2808 : : * The caller has already free'd the tuple the top node points to,
2809 : : * if necessary.
2810 : : */
2811 : : static void
2389 rhaas@postgresql.org 2812 : 11704 : tuplesort_heap_delete_top(Tuplesortstate *state)
2813 : : {
6622 tgl@sss.pgh.pa.us 2814 : 11704 : SortTuple *memtuples = state->memtuples;
2815 : : SortTuple *tuple;
2816 : :
8933 2817 [ + + ]: 11704 : if (--state->memtupcount <= 0)
8946 2818 : 138 : return;
2819 : :
2820 : : /*
2821 : : * Remove the last tuple in the heap, and re-insert it, by replacing the
2822 : : * current top node with it.
2823 : : */
2772 heikki.linnakangas@i 2824 : 11566 : tuple = &memtuples[state->memtupcount];
2389 rhaas@postgresql.org 2825 : 11566 : tuplesort_heap_replace_top(state, tuple);
2826 : : }
2827 : :
2828 : : /*
2829 : : * Replace the tuple at state->memtuples[0] with a new tuple. Sift up to
2830 : : * maintain the heap invariant.
2831 : : *
2832 : : * This corresponds to Knuth's "sift-up" algorithm (Algorithm 5.2.3H,
2833 : : * Heapsort, steps H3-H8).
2834 : : */
2835 : : static void
2836 : 2972952 : tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple)
2837 : : {
2772 heikki.linnakangas@i 2838 : 2972952 : SortTuple *memtuples = state->memtuples;
2839 : : unsigned int i,
2840 : : n;
2841 : :
2842 [ - + ]: 2972952 : Assert(state->memtupcount >= 1);
2843 : :
4408 rhaas@postgresql.org 2844 [ - + ]: 2972952 : CHECK_FOR_INTERRUPTS();
2845 : :
2846 : : /*
2847 : : * state->memtupcount is "int", but we use "unsigned int" for i, j, n.
2848 : : * This prevents overflow in the "2 * i + 1" calculation, since at the top
2849 : : * of the loop we must have i < n <= INT_MAX <= UINT_MAX/2.
2850 : : */
8933 tgl@sss.pgh.pa.us 2851 : 2972952 : n = state->memtupcount;
8946 2852 : 2972952 : i = 0; /* i is where the "hole" is */
2853 : : for (;;)
8933 2854 : 927426 : {
2468 2855 : 3900378 : unsigned int j = 2 * i + 1;
2856 : :
8946 2857 [ + + ]: 3900378 : if (j >= n)
2858 : 573566 : break;
8768 bruce@momjian.us 2859 [ + + + + ]: 4561498 : if (j + 1 < n &&
2389 rhaas@postgresql.org 2860 : 1234686 : COMPARETUP(state, &memtuples[j], &memtuples[j + 1]) > 0)
8946 tgl@sss.pgh.pa.us 2861 : 494226 : j++;
2389 rhaas@postgresql.org 2862 [ + + ]: 3326812 : if (COMPARETUP(state, tuple, &memtuples[j]) <= 0)
8946 tgl@sss.pgh.pa.us 2863 : 2399386 : break;
8933 2864 : 927426 : memtuples[i] = memtuples[j];
8946 2865 : 927426 : i = j;
2866 : : }
6622 2867 : 2972952 : memtuples[i] = *tuple;
8946 2868 : 2972952 : }
2869 : :
2870 : : /*
2871 : : * Function to reverse the sort direction from its current state
2872 : : *
2873 : : * It is not safe to call this when performing hash tuplesorts
2874 : : */
2875 : : static void
3446 rhaas@postgresql.org 2876 : 426 : reversedirection(Tuplesortstate *state)
2877 : : {
627 akorotkov@postgresql 2878 : 426 : SortSupport sortKey = state->base.sortKeys;
2879 : : int nkey;
2880 : :
2881 [ + + ]: 1032 : for (nkey = 0; nkey < state->base.nKeys; nkey++, sortKey++)
2882 : : {
3446 rhaas@postgresql.org 2883 : 606 : sortKey->ssup_reverse = !sortKey->ssup_reverse;
2884 : 606 : sortKey->ssup_nulls_first = !sortKey->ssup_nulls_first;
2885 : : }
2886 : 426 : }
2887 : :
2888 : :
2889 : : /*
2890 : : * Tape interface routines
2891 : : */
2892 : :
2893 : : static unsigned int
909 heikki.linnakangas@i 2894 : 2842621 : getlen(LogicalTape *tape, bool eofOK)
2895 : : {
2896 : : unsigned int len;
2897 : :
2898 [ - + ]: 2842621 : if (LogicalTapeRead(tape,
2899 : : &len, sizeof(len)) != sizeof(len))
7569 tgl@sss.pgh.pa.us 2900 [ # # ]:UBC 0 : elog(ERROR, "unexpected end of tape");
8946 tgl@sss.pgh.pa.us 2901 [ + + - + ]:CBC 2842621 : if (len == 0 && !eofOK)
7569 tgl@sss.pgh.pa.us 2902 [ # # ]:UBC 0 : elog(ERROR, "unexpected end of data");
8946 tgl@sss.pgh.pa.us 2903 :CBC 2842621 : return len;
2904 : : }
2905 : :
2906 : : static void
909 heikki.linnakangas@i 2907 : 801 : markrunend(LogicalTape *tape)
2908 : : {
8768 bruce@momjian.us 2909 : 801 : unsigned int len = 0;
2910 : :
471 peter@eisentraut.org 2911 : 801 : LogicalTapeWrite(tape, &len, sizeof(len));
8946 tgl@sss.pgh.pa.us 2912 : 801 : }
2913 : :
2914 : : /*
2915 : : * Get memory for tuple from within READTUP() routine.
2916 : : *
2917 : : * We use next free slot from the slab allocator, or palloc() if the tuple
2918 : : * is too large for that.
2919 : : */
2920 : : void *
627 akorotkov@postgresql 2921 : 2691782 : tuplesort_readtup_alloc(Tuplesortstate *state, Size tuplen)
2922 : : {
2923 : : SlabSlot *buf;
2924 : :
2925 : : /*
2926 : : * We pre-allocate enough slots in the slab arena that we should never run
2927 : : * out.
2928 : : */
2750 heikki.linnakangas@i 2929 [ - + ]: 2691782 : Assert(state->slabFreeHead);
2930 : :
2931 [ + - - + ]: 2691782 : if (tuplen > SLAB_SLOT_SIZE || !state->slabFreeHead)
627 akorotkov@postgresql 2932 :UBC 0 : return MemoryContextAlloc(state->base.sortcontext, tuplen);
2933 : : else
2934 : : {
2750 heikki.linnakangas@i 2935 :CBC 2691782 : buf = state->slabFreeHead;
2936 : : /* Reuse this slot */
2937 : 2691782 : state->slabFreeHead = buf->nextfree;
2938 : :
2939 : 2691782 : return buf;
2940 : : }
2941 : : }
2942 : :
2943 : :
2944 : : /*
2945 : : * Parallel sort routines
2946 : : */
2947 : :
2948 : : /*
2949 : : * tuplesort_estimate_shared - estimate required shared memory allocation
2950 : : *
2951 : : * nWorkers is an estimate of the number of workers (it's the number that
2952 : : * will be requested).
2953 : : */
2954 : : Size
2263 rhaas@postgresql.org 2955 : 73 : tuplesort_estimate_shared(int nWorkers)
2956 : : {
2957 : : Size tapesSize;
2958 : :
2959 [ - + ]: 73 : Assert(nWorkers > 0);
2960 : :
2961 : : /* Make sure that BufFile shared state is MAXALIGN'd */
2962 : 73 : tapesSize = mul_size(sizeof(TapeShare), nWorkers);
2963 : 73 : tapesSize = MAXALIGN(add_size(tapesSize, offsetof(Sharedsort, tapes)));
2964 : :
2965 : 73 : return tapesSize;
2966 : : }
2967 : :
2968 : : /*
2969 : : * tuplesort_initialize_shared - initialize shared tuplesort state
2970 : : *
2971 : : * Must be called from leader process before workers are launched, to
2972 : : * establish state needed up-front for worker tuplesortstates. nWorkers
2973 : : * should match the argument passed to tuplesort_estimate_shared().
2974 : : */
2975 : : void
2976 : 105 : tuplesort_initialize_shared(Sharedsort *shared, int nWorkers, dsm_segment *seg)
2977 : : {
2978 : : int i;
2979 : :
2980 [ - + ]: 105 : Assert(nWorkers > 0);
2981 : :
2982 : 105 : SpinLockInit(&shared->mutex);
2983 : 105 : shared->currentWorker = 0;
2984 : 105 : shared->workersFinished = 0;
2985 : 105 : SharedFileSetInit(&shared->fileset, seg);
2986 : 105 : shared->nTapes = nWorkers;
2987 [ + + ]: 319 : for (i = 0; i < nWorkers; i++)
2988 : : {
2989 : 214 : shared->tapes[i].firstblocknumber = 0L;
2990 : : }
2991 : 105 : }
2992 : :
2993 : : /*
2994 : : * tuplesort_attach_shared - attach to shared tuplesort state
2995 : : *
2996 : : * Must be called by all worker processes.
2997 : : */
2998 : : void
2999 : 106 : tuplesort_attach_shared(Sharedsort *shared, dsm_segment *seg)
3000 : : {
3001 : : /* Attach to SharedFileSet */
3002 : 106 : SharedFileSetAttach(&shared->fileset, seg);
3003 : 106 : }
3004 : :
3005 : : /*
3006 : : * worker_get_identifier - Assign and return ordinal identifier for worker
3007 : : *
3008 : : * The order in which these are assigned is not well defined, and should not
3009 : : * matter; worker numbers across parallel sort participants need only be
3010 : : * distinct and gapless. logtape.c requires this.
3011 : : *
3012 : : * Note that the identifiers assigned from here have no relation to
3013 : : * ParallelWorkerNumber number, to avoid making any assumption about
3014 : : * caller's requirements. However, we do follow the ParallelWorkerNumber
3015 : : * convention of representing a non-worker with worker number -1. This
3016 : : * includes the leader, as well as serial Tuplesort processes.
3017 : : */
3018 : : static int
3019 : 210 : worker_get_identifier(Tuplesortstate *state)
3020 : : {
3021 : 210 : Sharedsort *shared = state->shared;
3022 : : int worker;
3023 : :
3024 [ + - - + ]: 210 : Assert(WORKER(state));
3025 : :
3026 [ - + ]: 210 : SpinLockAcquire(&shared->mutex);
3027 : 210 : worker = shared->currentWorker++;
3028 : 210 : SpinLockRelease(&shared->mutex);
3029 : :
3030 : 210 : return worker;
3031 : : }
3032 : :
3033 : : /*
3034 : : * worker_freeze_result_tape - freeze worker's result tape for leader
3035 : : *
3036 : : * This is called by workers just after the result tape has been determined,
3037 : : * instead of calling LogicalTapeFreeze() directly. They do so because
3038 : : * workers require a few additional steps over similar serial
3039 : : * TSS_SORTEDONTAPE external sort cases, which also happen here. The extra
3040 : : * steps are around freeing now unneeded resources, and representing to
3041 : : * leader that worker's input run is available for its merge.
3042 : : *
3043 : : * There should only be one final output run for each worker, which consists
3044 : : * of all tuples that were originally input into worker.
3045 : : */
3046 : : static void
3047 : 210 : worker_freeze_result_tape(Tuplesortstate *state)
3048 : : {
3049 : 210 : Sharedsort *shared = state->shared;
3050 : : TapeShare output;
3051 : :
3052 [ + - - + ]: 210 : Assert(WORKER(state));
909 heikki.linnakangas@i 3053 [ - + ]: 210 : Assert(state->result_tape != NULL);
2263 rhaas@postgresql.org 3054 [ - + ]: 210 : Assert(state->memtupcount == 0);
3055 : :
3056 : : /*
3057 : : * Free most remaining memory, in case caller is sensitive to our holding
3058 : : * on to it. memtuples may not be a tiny merge heap at this point.
3059 : : */
3060 : 210 : pfree(state->memtuples);
3061 : : /* Be tidy */
3062 : 210 : state->memtuples = NULL;
3063 : 210 : state->memtupsize = 0;
3064 : :
3065 : : /*
3066 : : * Parallel worker requires result tape metadata, which is to be stored in
3067 : : * shared memory for leader
3068 : : */
909 heikki.linnakangas@i 3069 : 210 : LogicalTapeFreeze(state->result_tape, &output);
3070 : :
3071 : : /* Store properties of output tape, and update finished worker count */
2263 rhaas@postgresql.org 3072 [ - + ]: 210 : SpinLockAcquire(&shared->mutex);
3073 : 210 : shared->tapes[state->worker] = output;
3074 : 210 : shared->workersFinished++;
3075 : 210 : SpinLockRelease(&shared->mutex);
3076 : 210 : }
3077 : :
3078 : : /*
3079 : : * worker_nomergeruns - dump memtuples in worker, without merging
3080 : : *
3081 : : * This called as an alternative to mergeruns() with a worker when no
3082 : : * merging is required.
3083 : : */
3084 : : static void
3085 : 210 : worker_nomergeruns(Tuplesortstate *state)
3086 : : {
3087 [ + - - + ]: 210 : Assert(WORKER(state));
909 heikki.linnakangas@i 3088 [ - + ]: 210 : Assert(state->result_tape == NULL);
3089 [ - + ]: 210 : Assert(state->nOutputRuns == 1);
3090 : :
3091 : 210 : state->result_tape = state->destTape;
2263 rhaas@postgresql.org 3092 : 210 : worker_freeze_result_tape(state);
3093 : 210 : }
3094 : :
3095 : : /*
3096 : : * leader_takeover_tapes - create tapeset for leader from worker tapes
3097 : : *
3098 : : * So far, leader Tuplesortstate has performed no actual sorting. By now, all
3099 : : * sorting has occurred in workers, all of which must have already returned
3100 : : * from tuplesort_performsort().
3101 : : *
3102 : : * When this returns, leader process is left in a state that is virtually
3103 : : * indistinguishable from it having generated runs as a serial external sort
3104 : : * might have.
3105 : : */
3106 : : static void
3107 : 72 : leader_takeover_tapes(Tuplesortstate *state)
3108 : : {
3109 : 72 : Sharedsort *shared = state->shared;
3110 : 72 : int nParticipants = state->nParticipants;
3111 : : int workersFinished;
3112 : : int j;
3113 : :
3114 [ + - - + ]: 72 : Assert(LEADER(state));
3115 [ - + ]: 72 : Assert(nParticipants >= 1);
3116 : :
3117 [ - + ]: 72 : SpinLockAcquire(&shared->mutex);
3118 : 72 : workersFinished = shared->workersFinished;
3119 : 72 : SpinLockRelease(&shared->mutex);
3120 : :
3121 [ - + ]: 72 : if (nParticipants != workersFinished)
2263 rhaas@postgresql.org 3122 [ # # ]:UBC 0 : elog(ERROR, "cannot take over tapes before all workers finish");
3123 : :
3124 : : /*
3125 : : * Create the tapeset from worker tapes, including a leader-owned tape at
3126 : : * the end. Parallel workers are far more expensive than logical tapes,
3127 : : * so the number of tapes allocated here should never be excessive.
3128 : : */
909 heikki.linnakangas@i 3129 :CBC 72 : inittapestate(state, nParticipants);
3130 : 72 : state->tapeset = LogicalTapeSetCreate(false, &shared->fileset, -1);
3131 : :
3132 : : /*
3133 : : * Set currentRun to reflect the number of runs we will merge (it's not
3134 : : * used for anything, this is just pro forma)
3135 : : */
2263 rhaas@postgresql.org 3136 : 72 : state->currentRun = nParticipants;
3137 : :
3138 : : /*
3139 : : * Initialize the state to look the same as after building the initial
3140 : : * runs.
3141 : : *
3142 : : * There will always be exactly 1 run per worker, and exactly one input
3143 : : * tape per run, because workers always output exactly 1 run, even when
3144 : : * there were no input tuples for workers to sort.
3145 : : */
909 heikki.linnakangas@i 3146 : 72 : state->inputTapes = NULL;
3147 : 72 : state->nInputTapes = 0;
3148 : 72 : state->nInputRuns = 0;
3149 : :
3150 : 72 : state->outputTapes = palloc0(nParticipants * sizeof(LogicalTape *));
3151 : 72 : state->nOutputTapes = nParticipants;
3152 : 72 : state->nOutputRuns = nParticipants;
3153 : :
3154 [ + + ]: 218 : for (j = 0; j < nParticipants; j++)
3155 : : {
3156 : 146 : state->outputTapes[j] = LogicalTapeImport(state->tapeset, j, &shared->tapes[j]);
3157 : : }
3158 : :
2263 rhaas@postgresql.org 3159 : 72 : state->status = TSS_BUILDRUNS;
3160 : 72 : }
3161 : :
3162 : : /*
3163 : : * Convenience routine to free a tuple previously loaded into sort memory
3164 : : */
3165 : : static void
6190 tgl@sss.pgh.pa.us 3166 : 1861419 : free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
3167 : : {
1006 drowley@postgresql.o 3168 [ + + ]: 1861419 : if (stup->tuple)
3169 : : {
3170 : 1791799 : FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
3171 : 1791799 : pfree(stup->tuple);
3172 : 1791799 : stup->tuple = NULL;
3173 : : }
6190 tgl@sss.pgh.pa.us 3174 : 1861419 : }
3175 : :
3176 : : int
743 john.naylor@postgres 3177 :LBC (1866124) : ssup_datum_unsigned_cmp(Datum x, Datum y, SortSupport ssup)
3178 : : {
3179 [ # # ]: (1866124) : if (x < y)
743 john.naylor@postgres 3180 :UBC 0 : return -1;
743 john.naylor@postgres 3181 [ # # ]:LBC (1866124) : else if (x > y)
743 john.naylor@postgres 3182 :UBC 0 : return 1;
3183 : : else
743 john.naylor@postgres 3184 :LBC (1866124) : return 0;
3185 : : }
3186 : :
3187 : : #if SIZEOF_DATUM >= 8
3188 : : int
743 john.naylor@postgres 3189 :CBC 591076 : ssup_datum_signed_cmp(Datum x, Datum y, SortSupport ssup)
3190 : : {
704 drowley@postgresql.o 3191 : 591076 : int64 xx = DatumGetInt64(x);
3192 : 591076 : int64 yy = DatumGetInt64(y);
3193 : :
743 john.naylor@postgres 3194 [ + + ]: 591076 : if (xx < yy)
3195 : 226808 : return -1;
3196 [ + + ]: 364268 : else if (xx > yy)
3197 : 181776 : return 1;
3198 : : else
3199 : 182492 : return 0;
3200 : : }
3201 : : #endif
3202 : :
3203 : : int
3204 : 99351881 : ssup_datum_int32_cmp(Datum x, Datum y, SortSupport ssup)
3205 : : {
704 drowley@postgresql.o 3206 : 99351881 : int32 xx = DatumGetInt32(x);
3207 : 99351881 : int32 yy = DatumGetInt32(y);
3208 : :
743 john.naylor@postgres 3209 [ + + ]: 99351881 : if (xx < yy)
3210 : 25686611 : return -1;
3211 [ + + ]: 73665270 : else if (xx > yy)
3212 : 24075540 : return 1;
3213 : : else
3214 : 49589730 : return 0;
3215 : : }
|