Commit Graph

  • b6264c6765 fix ulimit and antiword bugs Matt Wells 2014-06-18 04:06:20 -07:00
  • e36d9d1f3a turn off dup removal for all download queries now, not just bulk jobs. it is confusing ppl too much Matt Wells 2014-06-17 18:50:42 -07:00
  • 4d9bc7dc08 update mwells 2014-06-17 17:25:27 -06:00
  • c314e61968 make sectiondb stats just a special case of facets mwells 2014-06-17 16:39:02 -06:00
  • b2e9c4e631 package bldg updates mwells 2014-06-16 21:50:32 -06:00
  • 584af942d4 Merge branch 'testing' into diffbot-matt mwells 2014-06-16 20:42:28 -07:00
  • d71922168e facetize the sectiondb stuff mwells 2014-06-16 20:40:35 -07:00
  • 514f919b11 license fix Matt Wells 2014-06-16 13:52:51 -07:00
  • 29a6851100 push old fixes Matt Wells 2014-06-16 12:53:14 -07:00
  • 549f8eb5bc fix bug in hosts.conf when expanding working dir. Matt Wells 2014-06-16 11:32:10 -07:00
  • 9be18173db fix compiler bug Matt Wells 2014-06-16 11:10:38 -07:00
  • 153a5ab11d Merge branch 'diffbot-testing' into testing Matt Wells 2014-06-16 07:08:36 -07:00
  • 3b8741a7cb trying to prevent some cores. Matt Wells 2014-06-16 07:03:51 -07:00
  • 406fec356a fix json parser core from bad json. Matt Wells 2014-06-16 06:56:16 -07:00
  • 3c1f89e052 move parm order around in gb.conf Matt Wells 2014-06-16 06:07:11 -07:00
  • 9244e172d0 fix calls to antiword and pdftohtml etc. mwells 2014-06-15 17:44:52 -07:00
  • 928456b102 fix merge mwells 2014-06-15 15:05:28 -07:00
  • 2be7c78f2f Merge branch 'diffbot-testing' into testing mwells 2014-06-15 15:02:17 -07:00
  • 4e3e4fd0d0 yay! get multidoc flatfile injection working. mwells 2014-06-15 14:57:38 -07:00
  • 6f813ed7a5 fixed/added support for multi doc (flatfile) injection mwells 2014-06-15 09:54:08 -07:00
  • b2923acaf1 added support for using delimeter with injections so one injected file can contain multiple documents. mwells 2014-06-15 09:10:00 -07:00
  • 7506d66d4a fixes for page inject mwells 2014-06-15 08:26:27 -07:00
  • c3bbcb9f92 parm-itize page reindex mwells 2014-06-15 07:56:27 -07:00
  • 2841569376 fix &roundStart=1 again to force a spider round for non-repeat collections. Matt Wells 2014-06-13 12:11:17 -07:00
  • 8e241297f2 integrated parm updates mwells 2014-06-13 11:07:01 -07:00
  • 5c0b371dc9 Merge branch 'testing' into diffbot-matt mwells 2014-06-13 11:00:09 -07:00
  • 308f2d07f7 fixes for section info injection into squid proxied responses mwells 2014-06-13 10:48:59 -07:00
  • 993804e6ab api table updates mwells 2014-06-13 09:37:53 -07:00
  • a123cad0d2 more page api updates mwells 2014-06-13 08:58:55 -07:00
  • 3cf3cddc5c beginning of total parm overhaul. new injection parms, just need to engage them. mwells 2014-06-12 21:27:06 -07:00
  • df8b9bd01a more fixes for section markup proxy mwells 2014-06-12 15:28:03 -07:00
  • 20c4ac4205 got it marking up html now with sectiondb stats. seems to work ok. mwells 2014-06-12 14:42:08 -07:00
  • ea90e7f755 more fixes for sectiondb markup code mwells 2014-06-12 13:05:45 -07:00
  • 76f1987785 fix roundstart bug Matt Wells 2014-06-12 07:52:30 -07:00
  • a425e181e7 api table parm cleanups. more to come mwells 2014-06-11 20:24:36 -07:00
  • 7ebaf531b0 inject url page was breaking rendering the new parms. mwells 2014-06-11 19:50:07 -07:00
  • ab7717d065 now use &roundStart=0 to trigger the next crawl round. now assume all crawl jobs are "repeat", but those that have repeat of "0" just assume 10 year frequency, 3652.5 days. that way the &roundStart=0 will do another round of crawling for them as well. Matt Wells 2014-06-11 18:45:58 -07:00
  • e4ce9bc9ac squidproxycache/floaters/sectiondbtagging all compiles. need to do run-time debugging now. mwells 2014-06-11 17:57:28 -07:00
  • 6f70282ba2 almost got sectiondb integration compiling mwells 2014-06-11 17:24:58 -07:00
  • 1e10c676d5 parm updates for injecting mwells 2014-06-11 17:24:33 -07:00
  • 66f8f3926d raise MAX_EXPRESSIONS Matt Wells 2014-06-10 19:32:46 -07:00
  • 27ffd23345 handle boolean query overflow errors better. Matt Wells 2014-06-10 17:21:55 -07:00
  • 365f29b293 made &spiderRoundStart=1 (or 0) force the next spider round to begin. also added pageUrl to XmlDoc::getContentHashJSON32() so it's not included in the hash to fix some spider-time deduping issues. Matt Wells 2014-06-10 14:20:41 -07:00
  • 108c281c33 fix annoying bug when adding new parms. mwells 2014-06-10 12:29:50 -07:00
  • 77241ecee0 fix make cygwin in Makefile mwells 2014-06-09 18:48:58 -07:00
  • 29e90d1d55 squid proxy fixes mwells 2014-06-09 16:10:24 -07:00
  • 5bf3042633 fix squid proxy cache key generation mwells 2014-06-09 14:37:13 -07:00
  • b71ea7f7c6 fixes for squid proxy simulator mwells 2014-06-09 14:31:48 -07:00
  • 4a2717a88f Merge branch 'diffbot-testing' into diffbot-matt mwells 2014-06-09 12:42:54 -07:00
  • 7d452a766c completed squid proxy simulation code mwells 2014-06-09 12:42:05 -07:00
  • 8968f094c0 ignore gbsortby:offerprice gbrevsortby:whatever query operators when evaluating boolean expressions. fix for '(title:fourth OR text:water) gbsortby:offerPrice' query Matt Wells 2014-06-09 11:00:27 -07:00
  • bc6c6b3ab7 Merge branch 'testing' into diffbot-testing Matt Wells 2014-06-09 10:18:25 -07:00
  • 56af753c3e fixed nasty bug of resetting RdbBases for random collnums, causing data loss and corruption. Matt Wells 2014-06-09 10:16:29 -07:00
  • 778e67130f File::set() fix for //'s mwells 2014-06-08 15:24:30 -07:00
  • 81fed12705 minor makefile updates mwells 2014-06-08 11:49:26 -07:00
  • 6fddeb416a fixes for 'make debian-testing' package building code for ubuntu/debian mwells 2014-06-08 11:35:39 -07:00
  • 01bfebaaaf admin.html updates mwells 2014-06-07 19:45:56 -07:00
  • c713454318 admin.html updates mwells 2014-06-07 18:50:24 -07:00
  • 9067013425 cygwin fixes mwells 2014-06-07 16:30:56 -07:00
  • 4e5cf747dc cygwin fixes mwells 2014-06-07 16:29:39 -07:00
  • e5cb5ab907 cygwin cleanups mwells 2014-06-07 15:59:32 -07:00
  • c07996d700 cygwin updates mwells 2014-06-07 14:58:57 -07:00
  • 27cc896a6c DEFS2 to Makefile mwells 2014-06-07 14:48:19 -07:00
  • f55d7cfd68 CYGWIN updates mwells 2014-06-07 14:39:48 -07:00
  • 778430a543 cygwin updates mwells 2014-06-07 14:37:21 -07:00
  • d57ce8a2df simplify compilation more. remove clones() mwells 2014-06-07 14:26:11 -07:00
  • 1553663d82 compiler cleanups for cygwin compile mwells 2014-06-07 14:20:04 -07:00
  • 628fe2336f make code compile cleaner. mwells 2014-06-07 14:11:12 -07:00
  • 04c5d78efe updated email. mwells 2014-06-07 11:20:01 -07:00
  • de3f51d30f add ubuntu package link in admin.html. mwells 2014-06-07 10:47:23 -07:00
  • 4a4fccfd93 added 'make testing-deb' support to build debian packages. mwells 2014-06-07 10:21:51 -07:00
  • a809c99abb email update Matt Wells 2014-06-06 19:31:24 -07:00
  • d16a1f3422 Merge branch 'diffbot-testing' into testing Matt Wells 2014-06-06 19:22:52 -07:00
  • 3b2ed3bdb4 fix compile issues on some machines by including the bits/ include subdir directly in the repo. also added -Ibits to the Makefile. Matt Wells 2014-06-06 19:05:01 -07:00
  • b9777d3f55 fix domain only bug in serps Matt Wells 2014-06-06 18:00:18 -07:00
  • 0fd85b788b halfway done coding up proxy (squid) support into gb mwells 2014-06-06 17:27:18 -07:00
  • 72df0d25d2 added safebuf base64decode func mwells 2014-06-06 16:20:15 -07:00
  • 08e42a64cb any /admin/ cmd as well should not trunc posts Matt Wells 2014-06-06 16:14:47 -07:00
  • c7c39005f1 do not truncate crawl jobs POSTs Matt Wells 2014-06-06 16:08:37 -07:00
  • 965d992f98 Merge branch 'diffbot-testing' into diffbot-matt mwells 2014-06-06 15:14:41 -07:00
  • 3f2dcda4e1 got new floater/proxy logic compiling. mwells 2014-06-06 15:11:51 -07:00
  • 6b5b83ac85 fixes for gbmin/gbmax being first query term. Matt Wells 2014-06-06 10:20:12 -07:00
  • 5b6100c77d log format change for errcnt Matt Wells 2014-06-06 09:29:57 -07:00
  • d850f5f006 try to prevent job status flip flop from error retries. Matt Wells 2014-06-05 23:38:54 -07:00
  • 74f0a41290 bulk jobs give up after downloading a url 3 times. crawls don't give up on tmperrors, but retry every 30 days. Matt Wells 2014-06-05 23:11:14 -07:00
  • 172d7071a7 fix to rename tagdb0000.002.dat Matt Wells 2014-06-05 22:21:41 -07:00
  • 8ac691f324 fix merging getting clogged by so many collections tring to merge tagdb at once Matt Wells 2014-06-05 21:27:33 -07:00
  • 13243a411c more fixes for fake http reply hack Matt Wells 2014-06-05 20:31:49 -07:00
  • ce7294e9a9 more mem leak fixes for fake bulk job empty http replies Matt Wells 2014-06-05 20:09:12 -07:00
  • 7f10fca234 no longer for add www to url domain if it is just a domain. was messing of tmblr.co where www.tmblr.co has no IP. Matt Wells 2014-06-05 17:00:12 -07:00
  • 3c6a8bf87e fix issue of not retrying diffbot internal errors. Matt Wells 2014-06-05 16:24:52 -07:00
  • cfda735194 print error stuff in spiderdb dump. Matt Wells 2014-06-05 16:14:32 -07:00
  • 970eb33a83 sanity checks to ensure fakefirstip was able to convert to a real good firstip Matt Wells 2014-06-05 16:13:33 -07:00
  • 7b4b8b27bd more debug msgs Matt Wells 2014-06-05 14:58:20 -07:00
  • 5f41840211 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-06-05 14:48:23 -07:00
  • 8477ef72f8 support gbmin gbmax gbminint gbmaxint range query terms properly, when generating the docidvotebuf. fixes boolean queries using them as well. Matt Wells 2014-06-05 14:47:45 -07:00
  • 1fe2c94322 add some debug notes Matt Wells 2014-06-05 12:26:06 -07:00
  • 780fd43aae timestamp bug fix Matt Wells 2014-06-04 15:50:26 -07:00
  • 2a36e5bde5 Merge branch 'diffbot-testing' into diffbot-matt mwells 2014-06-04 14:40:34 -07:00
  • 546d135007 fix boolean queries to do the on-demand mini merges of the termlists. should fix gbmin:offerprice:100 AND (text:lord OR text:helicopter) Matt Wells 2014-06-04 14:33:54 -07:00